Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Performed statistical analysis on different datasets using multiple and logistic regression using R Studio and SPSS on Gender Inequality ratio data and Employment to Population data respectively.
Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Performed statistical analysis on different datasets using multiple and logistic regression using R Studio and SPSS on Gender Inequality ratio data and Employment to Population data respectively.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by
numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
The amount of information in the form of features and variables available to machine learning algorithms is ever increasing. This can lead to classifiers that are prone to overfitting in high dimensions, high dimensional models do not lend themselves to interpretable results, and the CPU and memory resources necessary to run on high-dimensional datasets severly limit the applications of the approaches.
Variable and feature selection aim to remedy this by finding a subset of features that in some way captures the information provided best.
In this paper we present the general methodology and highlight some specific approaches.
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic CompoundsIOSR Journals
Quantitative Structure Activity Relationships (QSAR) of some heterocyclic compounds was studied using some 3D parameters. The QSAR models indicated that Dipole Y, Dipole mag., Y length and some indicator parameters are very effective in describing the antifungal activities of these compounds against Candida albicans in the training and external test set. The multiple regression analysis have produced well predictive statistically significant and cross validated QSAR models which help to explore some expectedly potent compounds.
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...inventionjournals
:A moment inequality is derived for the system whose life distribution is in an overall decreasing life (ODL) class of life distributions. A new nonparametric test statistic for testing exponentiality against ODL is investigated based on this inequality. The asymptotic normality of the proposed statistic is presented. Pitman's asymptotic efficiency, power and critical values of this test are calculated to assess the performance of the test. Real examples are given to elucidate the use of the proposed test statistic in the reliability analysis. Wealso proposed a test for testing exponentiality versus ODL for right censored data and the power estimates of this test are also simulated for censored data for some commonly used distributions in reliability. Finally, real data are used as an example for practical problems.
A controversial genetic restoration mechanism has been proposed for the model organism Arabidopsis thaliana. This theory proposes that genetic material from non-parental ancestors is used to restore genetic information that was inadvertently corrupted during reproduction. We evaluate the effectiveness of this strategy by adapting it to an evolutionary algorithm solving two distinct benchmark optimization problems. We compare the performance of the proposed strategy with a number of alternate strategies – including the Mendelian alternative. Included in this comparison are a number of biologically implausible templates that help elucidate likely reasons for the relative performance of the different templates. Results show that the proposed non- Mendelian restoration strategy is highly effective across the range of conditions investigated – significantly outperforming the Mendelian alternative in almost every situation.
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesIlonaThornburg83
Module 05 –
Hypothesis Tests Using Two Samples
Class Objectives:
· Identify whether two samples are independent or dependent.
· Compare the testing procedures for two sample tests.
· Test hypothesis about two population parameters.
Module 05 - Part 1
Last week we took one sample to see if it supported our alternative hypothesis. This week we are going to increase to TWO samples and see if there is a significant difference between them.
When would we use this?
· Two samples are __________________________________ if the sample values from one population are not related to or somehow naturally paired or matched with the sample values from the other population.
· Example:
· Two samples are _____________________________ (or consist of ______________________________________) if the sample values are somehow matched, where the matching is based on some inherent relationship.
· Example:
Hint: If the two samples have different sample sizes with no missing data, they must be independent. If the two samples have the same sample size, the samples may or may not be independent.
Put the variables in for each population in the table below.
Population 1
Population 2
Population Mean
Population Standard Deviation
Population Proportion
Sample Size
Sample Mean
Sample Standard Deviation
Sample Proportion
Note: We are going to approach the problem as if are unknown. This is the most common and means that we will be using the t test statistic.
· The test statistic is given by the formula below:
where we assume .
To calculate the degrees of freedom, pick the _______________________ n value and subtract 1.
We will be doing the same steps as before to test the hypothesis (either critical value or p-value test). There are just different formulas.
· The null hypothesis is given as _____________________________.
· The alternative hypothesis will be either ____________________________, ___________________________, or _____________________________.
Example 1. Data Set 26 “Cola Weights and Volumes” in Appendix B includes weights (lb) of the contents of cans of Diet Coke (n = 36, x = 0.78479 lb, s = 0.00439 lb) and of the contents of cans of regular Coke (n = 36, x = 0.81682 lb, s = 0.00751 lb). Use a 0.05 significance level to test the claim that the contents of cans of Diet Coke have weights with a mean that is less than the mean for regular Coke.
Example 2. Researchers from the University of British Columbia conducted trials to investigate the effects of color on creativity. Subjects with a red background were asked to think of creative uses for a brick; other subjects with a blue background were given the same task. Responses were scored by a panel of judges and results from scores of creativity are given below. Higher scores correspond to more creativity. The researchers make the claim that “blue enhances performance on a creative task.” Use a 0.05 significance level to test the claim that blue enhances perform ...
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by
numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
The amount of information in the form of features and variables available to machine learning algorithms is ever increasing. This can lead to classifiers that are prone to overfitting in high dimensions, high dimensional models do not lend themselves to interpretable results, and the CPU and memory resources necessary to run on high-dimensional datasets severly limit the applications of the approaches.
Variable and feature selection aim to remedy this by finding a subset of features that in some way captures the information provided best.
In this paper we present the general methodology and highlight some specific approaches.
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic CompoundsIOSR Journals
Quantitative Structure Activity Relationships (QSAR) of some heterocyclic compounds was studied using some 3D parameters. The QSAR models indicated that Dipole Y, Dipole mag., Y length and some indicator parameters are very effective in describing the antifungal activities of these compounds against Candida albicans in the training and external test set. The multiple regression analysis have produced well predictive statistically significant and cross validated QSAR models which help to explore some expectedly potent compounds.
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...inventionjournals
:A moment inequality is derived for the system whose life distribution is in an overall decreasing life (ODL) class of life distributions. A new nonparametric test statistic for testing exponentiality against ODL is investigated based on this inequality. The asymptotic normality of the proposed statistic is presented. Pitman's asymptotic efficiency, power and critical values of this test are calculated to assess the performance of the test. Real examples are given to elucidate the use of the proposed test statistic in the reliability analysis. Wealso proposed a test for testing exponentiality versus ODL for right censored data and the power estimates of this test are also simulated for censored data for some commonly used distributions in reliability. Finally, real data are used as an example for practical problems.
A controversial genetic restoration mechanism has been proposed for the model organism Arabidopsis thaliana. This theory proposes that genetic material from non-parental ancestors is used to restore genetic information that was inadvertently corrupted during reproduction. We evaluate the effectiveness of this strategy by adapting it to an evolutionary algorithm solving two distinct benchmark optimization problems. We compare the performance of the proposed strategy with a number of alternate strategies – including the Mendelian alternative. Included in this comparison are a number of biologically implausible templates that help elucidate likely reasons for the relative performance of the different templates. Results show that the proposed non- Mendelian restoration strategy is highly effective across the range of conditions investigated – significantly outperforming the Mendelian alternative in almost every situation.
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesIlonaThornburg83
Module 05 –
Hypothesis Tests Using Two Samples
Class Objectives:
· Identify whether two samples are independent or dependent.
· Compare the testing procedures for two sample tests.
· Test hypothesis about two population parameters.
Module 05 - Part 1
Last week we took one sample to see if it supported our alternative hypothesis. This week we are going to increase to TWO samples and see if there is a significant difference between them.
When would we use this?
· Two samples are __________________________________ if the sample values from one population are not related to or somehow naturally paired or matched with the sample values from the other population.
· Example:
· Two samples are _____________________________ (or consist of ______________________________________) if the sample values are somehow matched, where the matching is based on some inherent relationship.
· Example:
Hint: If the two samples have different sample sizes with no missing data, they must be independent. If the two samples have the same sample size, the samples may or may not be independent.
Put the variables in for each population in the table below.
Population 1
Population 2
Population Mean
Population Standard Deviation
Population Proportion
Sample Size
Sample Mean
Sample Standard Deviation
Sample Proportion
Note: We are going to approach the problem as if are unknown. This is the most common and means that we will be using the t test statistic.
· The test statistic is given by the formula below:
where we assume .
To calculate the degrees of freedom, pick the _______________________ n value and subtract 1.
We will be doing the same steps as before to test the hypothesis (either critical value or p-value test). There are just different formulas.
· The null hypothesis is given as _____________________________.
· The alternative hypothesis will be either ____________________________, ___________________________, or _____________________________.
Example 1. Data Set 26 “Cola Weights and Volumes” in Appendix B includes weights (lb) of the contents of cans of Diet Coke (n = 36, x = 0.78479 lb, s = 0.00439 lb) and of the contents of cans of regular Coke (n = 36, x = 0.81682 lb, s = 0.00751 lb). Use a 0.05 significance level to test the claim that the contents of cans of Diet Coke have weights with a mean that is less than the mean for regular Coke.
Example 2. Researchers from the University of British Columbia conducted trials to investigate the effects of color on creativity. Subjects with a red background were asked to think of creative uses for a brick; other subjects with a blue background were given the same task. Responses were scored by a panel of judges and results from scores of creativity are given below. Higher scores correspond to more creativity. The researchers make the claim that “blue enhances performance on a creative task.” Use a 0.05 significance level to test the claim that blue enhances perform ...
Network analysis of cancer metabolism: A novel route to precision medicineVarshit Dusad
Masters project presentation for MRes Systems and Synthetic Biology 2017-18 Imperial College London.
Study of cancer metabolism using constraint-based modeling and graph theory.
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...CSCJournals
Recently a great deal of attention has been paid to modern regression methods such as penalized regressions which perform variable selection and coefficient estimation simultaneously, thereby providing new approaches to analyze complex data of high dimension. The choice of the tuning parameter is vital in penalized regression. In this paper, we studied the effect of different tuning parameter choosing criteria on the performances of some well-known penalization methods including ridge, lasso, and elastic net regressions. Specifically, we investigated the widely used information criteria in regression models such as Bayesian information criterion (BIC), Akaike’s information criterion (AIC), and AIC correction (AICc) in various simulation scenarios and a real data example in economic modeling. We found that predictive performance of models selected by different information criteria is heavily dependent on the properties of a data set. It is hard to find a universal best tuning parameter choosing criterion and a best penalty function for all cases. The results in this research provide reference for the choices of different criteria for tuning parameter in penalized regressions for practitioners, which also expands the nascent field of applications of penalized regressions.
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestStevenQu1
Research paper drafted during my 2 year internship with Oak Ridge National Laboratory illustrating the potential of artificial intelligence in cancer research.
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
Presented at 15th International Conference on BioInformatics and BioEngineering (BIBE2014)
Prognostic modeling is central to medicine, as it is often used to predict patients’ outcome and response to treatments and to identify important medical risk factors. Logistic regression is one of the most used approaches for clinical pre- diction modeling. Traumatic brain injury (TBI) is an important public health issue and a leading cause of death and disability worldwide. In this study, we adapt CPXR (Contrast Pattern Aided Regression, a recently introduced regression method), to develop a new logistic regression method called CPXR(Log), for general binary outcome prediction (including prognostic modeling), and we use the method to carry out prognostic modeling for TBI using admission time data. The models produced by CPXR(Log) achieved AUC as high as 0.93 and specificity as high as 0.97, much better than those reported by previous studies. Our method produced interpretable prediction models for diverse patient groups for TBI, which show that different kinds of patients should be evaluated differently for TBI outcome prediction and the odds ratios of some predictor variables differ significantly from those given by previous studies; such results can be valuable to physicians.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
2. 1 Linear Classification Models (Part I)
1.1 Classification Imbalance
The hepatic injury status response variable is a 3-class classifier. It has 3 possible values,
which are “None” “Mild Severity” and “Severe”. The distribution of this response variable is
shown in Figure 1. As we can see it, the distribution is highly imbalanced, which would
impose serious problem in model training. There are several ways to handle this problem,
one of the most popular methods is to use sampling techniques to reconstruct a balanced
training dataset. Ling and Li (1998)1 provide an approach to up-sampling in which cases
from the minority classes are sampled with replacement until each class has approximately
the same number. Also, the reason I prefer up-Sampling over down-Sampling in this context
is that the number of “Severe” observation is so limited, and the size of training dataset using
down-Sampling would be limited so that the model would not be well trained2. Therefore,
the way to create the training dataset is as follows:
1) First, set the random seed to be zj2160, and randomly assign every sample in the whole
dataset into training dataset and test dataset. The probability to be assigned into training
dataset is 80%.
2) Next, use function upSample in library caret to reconstruct the training dataset so that
the new training dataset would be a balanced one.
Do I need to use up-sampling method to reconstruct the test dataset? The reason is no since
if the training dataset is sampled to be balanced, the test dataset should be sampled to be
more consistent with the state of nature and should reflect the imbalance so that honest
estimates of future performance could be computed.
1.2 Classification Statistic
There are 3 usual classification statistics, which are AUC(area under curve), Kappa and
Accuracy. For 2-class classification problem, we usually use AUC as the classification
statistic. However, in this context, the response variable contains 3 classes. Two solutions
would be presented as follows:
1) Use Kappa or Accuracy as the classification statistic, since AUC is only appropriate for
2-class classification problem. However, some models are natively not suitable for
multi-class classification problem such as logistic regression model (although there is
multi-logistic regression model to compensate).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1! Ling!C,!Li!C!(1998).!“Data!Mining!for!Direct!Marketing:!Problems!and!solutions.”!In!“Proceedings!of!the!Fourth!
International!Conference!on!Knowledge!Discovery!and!Data!Mining,”!pp.!73–79.!
2! I!have!tested!both!downNSampling!method!and!upNSampling!method,!and!the!comparison!could!be!referred!at!
later!part.!In!fact,!the!prediction!performance!of!most!models!would!become!better!by!substituting!
downNSampling!with!upNSampling.!The!reason!may!be!that!with!limited!“Severe”!sample,!downNSampling!
method!would!produce!a!small!training!dataset!so!that!the!model!would!not!be!well!trained.!
3. 2) Still use Kappa or Accuracy as the final classification statistic, but build k sub-models
for all k classes. More specifically, create k binary variables, and the value would be 1
if this sample belongs to the corresponding class, 0 otherwise. In this context, the
binary variables would be as follows:
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"!#!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#$!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!
!"#"$"! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#"$"!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!
Then I would train 3 separate models using these 3 response variables. When selecting
tuning parameter, I would use AUC to select the optimal tuning parameter. When
predicting, I would combine the 3 predictions of possibility using softmax
transformation (Bridle 19903) which is defined as
∗ =
!!
!!!
!!! !!
!!
∗ is the transformed
where !!!is the possibility prediction for the !!! class and !!
∗.
value between 0 and 1. The final prediction would be class with the largest !!
I decided to use the latter one since it can accommodate all the models. The final
classification statistic when measuring the prediction performance would be Kappa, and I
would also use Accuracy as a reference.
1.3 Comparison between Models Based Separately on Bio and Chem
There are in total 4 linear classification models discussed in Chapter 12, which are Logistic
Regression Model, Linear Discriminant Analysis, Partial Least Square Discriminant
Analysis and Penalized Models. The result can be referred in table 1 and table 2. As we can
see, when we only use biological predictors, Penalized Models yields the best performance
with Kappa of 0.13 in up-sampling and 0.193 in down-sampling. When we only use
chemical fingerprint predictors, Partial Least Square Discriminant Analysis (PLSDA) yields
the best performance with Kappa of 0.277 in up-sampling and 0.246 in down-sampling.
Based on the results, it’s quite obvious that chemical fingerprint predictors contain the most
information about hepatic toxicity. This point could be further demonstrated when we
consider non-linear models.
1.4 Top Predictors
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3! Bridle!J!(1990).!“Probabilistic!Interpretation!of!Feedforward!Classification!Network!Outputs,!with!
Relationships!to!Statistical!Pattern!Recognition.”!In!“Neurocomputing:!Algorithms,!Architectures!and!
Applications,”!pp.!227–236.!Springer–Verlag.!
4. For the optimal model for biological predictors which is Penalized Model(up-sampling), the
top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118,
Z98, Z48, Z64. See Figure 2 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99,
Z53, Z79. See Figure 3 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,
Z102, Z15, Z59. See Figure 4 for details.
For the optimal model for chemical fingerprint predictors which is PLSDA(up-sampling),
the top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X134, X188,
X154, X83, X72. See Figure 5 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are X140, X147,
X31, X134, X67. See Figure 6 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are X72, X113,
X44, X136, X81. See Figure 7 for details.
1.5 Comparison between Models Based on Both Bio and Chem
The optimal model for biological and chemical predictors is PLSDA, and it yields a Kappa
of 0.372 in up-sampling and a Kappa of 0.186 in down-sampling. With both sets of
predictors, the PLSDA model has a significantly better performance than those with only one
set of predictors.
The top 5 predictors for PLSDA model(up-sampling) are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X134, X154,
Z116, Z149, Z38. See Figure 8 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are Z116, Z93, X38,
X98, X155. See Figure 9 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are Z69, Z100,
X72, Z102, Z93. See Figure 10 for details.
When comparing those top 5 important variables with previous results, we can see quite
easily that for “None” and “Severe”, the top 5 predictors seem to come separately from the
top 5 predictors in biological predictors and chemical predictors, for example, in “Severe”,
Z100 and Z102 are all among the top 5 predictors in previous result. Another interesting
thing is that the percentage of Z-predictors in top 5 lists is higher than that of X-predictors,
which again confirms what we got previously that the chemical fingerprints predictors
5. contains most information about hepatic toxicity.
1.6 Suggestion
I would recommend using both biological and chemical predictors information, and using
upsampling to train PLSDA model. This would yield a quite accurate prediction. Since it’s
easy to see in the table 3 that almost all the performances of down-sampling method are
worse than that of up-sampling, we should use upsampling method to train the model. Also,
among all the linear classification models, PLSDA outperformances others with a Kappa of
0.372 which would qualify as a good prediction.
2 Nonlinear Classification Models (Part II)
2.1 Comparison between Models Based Separately on Bio and Chem
There are in total 6 nonlinear classification models discussed in Chapter 13, which are
Regulated Discriminant Analysis(I combined Quadratic Discriminant Analysis in it by
setting lambda to 1), Neural Network, Average Neural Network, Flexible Discriminant
Analysis, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes. The result can be
referred in table 4 and table 5. As we can see, when we only use biological predictors,
Averaged Neural Network (AvNNet) yields the best performance with Kappa of 0.368 in
up-sampling and 0.119 in down-sampling. When we only use chemical fingerprint predictors,
Support Vector Machine (SVM) yields the best performance with Kappa of 0.328 in
up-sampling and 0.235 in down-sampling.
When compared with linear classification models, when we only use biological predictors,
the nonlinear structure of these models greatly help to improve the classification
performance, as we can see the best of linear model could only yield a Kappa of 0.13, but
almost all the nonlinear models yield a higher Kappa with the highest to be 0.368.
However, when we only use chemical predictors, the nonlinear structure does help but not as
much as the case in biological predictors. The highest Kappa with a nonlinear model is 0.372
but the highest Kappa with a linear model is 0.277.
2.2 Top Predictors
For the optimal model for biological predictors which is AvNNet(up-sampling), the top 5
important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118,
Z98, Z48, Z64. See Figure 11 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99,
Z53, Z79. See Figure 12 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,
6. Z102, Z15, Z59. See Figure 13 for details.
For the optimal model for chemical fingerprint predictors which is SVM(up-sampling), the
top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95,
X133, X120. See Figure 14 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1,
X132, X28, X125. See Figure 15 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145,
X133, X139, X81. See Figure 16 for details.
2.3 Comparison between Models Based on Both Bio and Chem
The optimal model for biological and chemical predictors is Naïve Bayes, and it yields a
Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling. With both sets of
predictors, the Naïve Bayes model has a slightly better performance than those with only one
set of predictors.
The top 5 predictors for Naïve Bayes model(up-sampling) are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95,
X133, Z130. See Figure 17 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1,
X132, X28, X125. See Figure 18 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145,
X133, X139, X81. See Figure 19 for details.
When compared with previous results, the top 5 important variables are almost identical to
those of using only chemical fingerprint predictors. The only difference is in predicting
“None”, the 5th important variable is Z130 rather than X120. Also, this again strongly
confirms the previous conclusion that chemical fingerprints predictors contain most of the
information about hepatic toxicity, since almost all the important variable are
X-predictors(chemical fingerprint predictors).
2.4 Suggestion
I would recommend using both biological and chemical predictors information, and using
up-sampling to train Naïve Bayes Model. The nonlinear structure indeed helps to improve
performance over linear models. With a Kappa of 0.306 in up-sampling and a Kappa of
0.403 in down-sampling, well-trained Naïve Bayes Model outperforms the optimal linear
model. Therefore, I would recommend using Naïve Bayes to predict the hepatic toxicity.
7. 3 Tree-based Classification Models (Part III)
3.1 CART & Conditional Inference Trees
Both CART trees and conditional inference trees models are built using chemistry predictors,
and Kappa statistic is used as the metric. When comparing the performance of predicting the
whole dataset, CART(tuning parameter mtry is 100) has Accuracy of 0.568 and Kappa of
0.21, while conditional inference tree(tuning parameter mtry is 10) has Accuracy of 0.534
and Kappa of 0.0996. It’s obvious that random forest with CART has better performance.
3.2 Computation Time Comparison
The output of the computation time is as follows:
> ## Obtain the computation time for each model
> rfCART$times$everything
user system elapsed
492.665 2.582 171.341
> rfcForest$times$everything
user system elapsed
581.095 52.354 169.595
As we can see, CART trees not only have a better performance, but also have a less
computational time than conditional inference trees. Therefore, I would prefer CART over
conditional inference tree.
3.3 Top Predictors
Figure 20 and Figure 21 shows the top 10 important variables for both models.
More specifically, for CART, the top 10 important variables are: X1, X132, X71, X28, X31,
X29, X147, X30, X11, X6.
For conditional inference tree, the top 10 important variables are: X132, X134, X1, X71,
X35, X95, X139, X38, X98, X160.
The top 10 most important variables are mostly different between CART and Conditional
Inference in that in Conditional Inference, statistical hypothesis tests are used to do
exhaustive search across predictors and their possible split points, and for every candidate
split, a statistical test is used to evaluate the difference between means of two groups created
by the split. However, the CART model, when choosing the possible split points, has a
different objective function, which is to maximize the reduction of square errors. This
difference in objective function may be the reason that two models have noticeable
difference.