Parametric Regression Modeling and Nonparametric Tests

1
STA6166 (5842): STATISTICAL METHODS FOR RESEARCH – 1
Project Assignment
Kalaivanan Murthy
Problem 1: Parametric Multiple Regression Model
Steps:
i. Develop a fuller model with all predictors, with powers 1 and without interaction terms.
ii. Develop a correlation matrix, and observe for highly correlated terms.
iii. Observe results of above steps and remove the predictors with high p-value and highly correlated predictors.
While removing highly correlated predictors, remove the one which has high p-value and keep the other.
iv. Develop a reduced model and add interaction terms based on your observation. Make sure the number of
predictors including interaction does not exceed one-tenth of number of observations.
v. Check the model for four assumptions.
vi. If assumptions are violated do Power Transformation. If model fit is major assumption violated, transform the
predictor. If normality, independence or homogeneity is violated transform response.
vii. If the power is close to 1 or violates assumptions, no transformation is necessary as such.
viii. Look for higher R2 and to arrive at a precise model, and R2
-adj and Akaike Inference Coefficient for an
effective model.
Illustrated:
i. A fuller model:
R2
=0.46 R2
adj=0.42 AIC=952.16
ii. Correlation Matrix
iii. Assumptions for a Reduced Model ( .~. –Weight–Penetration–Age+SVI.Score) R2
adj=0.49
Normality seems to be violated.
Independence is OK.
Homogeneity of variance seems to be violated
in some regions.
Fit of model is OK.
vii. Estimated Power = 0.1256. It enhances normality but slightly violated independence. It is accepted and transformation
is done.
Final Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.24264 40.53932 -0.376 0.707814
Cancer.Volume 2.03225 0.59359 3.424 0.000936
Weight 0.01132 0.07395 0.153 0.878708
Age -0.53721 0.47588 -1.129 0.261977
BPH 1.29831 1.20168 1.08 0.282878
SVI 19.60957 10.89184 1.8 0.075187
Penetration 1.09877 1.33377 0.824 0.412253
Score 7.05922 5.19452 1.359 0.177589
PSA Cancer.VolumeWeight Age BPH SVI PenetrationScore
PSA 1 0.624 0.026 0.017 -0.016 0.529 0.551 0.43
Cancer.Volume 0.624 1 0.005 0.039 -0.133 0.582 0.693 0.481
Weight 0.026 0.005 1 0.164 0.322 -0.002 0.002 -0.024
Age 0.017 0.039 0.164 1 0.366 0.118 0.1 0.226
BPH -0.016 -0.133 0.322 0.366 1 -0.12 -0.083 0.027
SVI 0.529 0.582 -0.002 0.118 -0.12 1 0.68 0.429
Penetration 0.551 0.693 0.002 0.1 -0.083 0.68 1 0.462
Score 0.43 0.481 -0.024 0.226 0.027 0.429 0.462 1

2
ln (PSA) = -1.156 + 0.092*Cancer.Volume - 0.03*Penetration + 0.489*Score - 0.919*SVI +
0.268*Score*SVI
This has an R2
=0.57 R2
-adj=0.55 AIC=298.1 (lowest) λ=0.1256
Prediction Interval: ln (PSA) = (0.352, 3.994) [PSA] – mg/ml.
Point estimate for the given data: ln(PSA) = 2.173 ln(mg/ml)
Problem-2: Parametric Completely Random Design
Steps:
i. Read the data and sort it to two columns: values and treatment. If the lengths are different, use factor
argument in data.frame.
ii. Check for distribution by qq-plot. If normal, use parametric procedures. If not, use non-parametric.
iii. Create an analysis of variance table.
iv. Check for four assumptions: normality, independence, homogeneity, fit. If follows, model is valid.
v. Compare pairwise treatment and check if ‘0’ is in interval. If yes, then means differ (statistically).
vi. The data is found to be normal. Bonferroni and Tukey’s method at 95% confidence interval is used here.
iii. Analysis of Variance Table
iv. Model Validity: Check for Assumptions
v. Pairwise Treatment
 Bonferroni (level of significance=5%)  Tukey (level of significance=5%)
From the analysis, it is observed that difference in sorption rates exist for following treatments:
 Aromatics and Esters (observed in both)
 Chloroalkanes and Esters (observed only in
Bonferroni) 𝐴 𝐶̅̅̅̅̅ 𝐸
Response: S.Rate
Df Sum Sq Mean Sq F value Pr(>F)
Solvent 2 3.3054 1.65270 24.512 5.855e-07 ***
Residuals 29 1.9553 0.06743
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "95 % Pairwise CIs"
Diff. Lower Upper Differ?
A - C -0.064 -0.385 0.257 0
A - E 0.612 0.334 0.890 1
C - E 0.676 0.387 0.965 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = S.Rate ~ Solvent, data = io)
$Solvent
diff lwr upr p adj
C-A 0.06402778 -0.2475781 0.3756337 0.8683140
E-A -0.61222222 -0.8826095 -0.3418350 0.0000143
E-C -0.67625000 -0.9570006 -0.3954994 0.0000054

3
In Bonferroni's, Chloroalkanes and Esters are found to differ. But in Tukey’s, it is not. Since Bonferroni’s is more
conservative, we adopt it for conclusion. Unit: mole percentage
Problem-3: Nonparametric Randomized Block Design
Steps:
i. Determine the factor (treatment) and the block. Here a machine is assigned to an operator, Machine is taken
as block and Operator is taken as treatment. However to the problem requires to check for both cases.
ii. Create a dataframe and arrange the values at given levels (also known as factors).
iii. Make an anova table to find if treatments differ, atleast one.
iv. If the sample size is too small, adopt a non-parametric test. (Friedman’s test is done here)
v. Use the function friedman.test or friedman.test2.
vi. Check if zero is in the interval. If yes, then the factors do not differ.
ii. Dataframe with levels/factors: iii. Analysis of Variance Table
At level of significance=10%, the Machine and Operator treatment differs.
v. Friedman Nonparametric Test
 Observation as Factor  Machine as Factor
From the analysis, it is observed that the difference in Time exist for following treatments:
 Operator: 1-5, 1-6, 2-5, 2-6
 Machine: 1-4, 2-3, 2-4
𝑂2 𝑂1̅̅̅̅̅̅̅̅̅ 𝑂3 𝑂4̅̅̅̅̅̅̅̅ 𝑂6 𝑂5
𝑀1 𝑀2̅̅̅̅̅̅̅̅̅̅ 𝑀3̅̅̅̅ 𝑀4
[Time]: s
Operator
Machine O1 O2 O3 O4 O5 O6
M1 42.5 39.3 39.6 39.9 42.9 43.6
M2 39.8 40.1 40.5 42.3 42.5 43.1
M3 40.2 40.5 41.3 43.4 44.9 45.1
M4 41.3 42.2 43.5 44.2 45.9 42.3
Df Sum Sq Mean Sq F value Pr(>F)
Machine 3 15.92 5.308 3.339 0.04790 *
Operator 5 42.09 8.417 5.294 0.00533 **
Residuals 15 23.85 1.590
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "90 % Pairwise CIs on rank sums"
Difference Lower Upper Differ?
O1 - O2 0 -9.9522 9.9522 0
O1 - O3 -5 -14.9522 4.9522 0
O1 - O4 -9 -18.9522 0.9522 0
O1 - O5 -14 -23.9522 -4.0478 1
O1 - O6 -14 -23.9522 -4.0478 1
O2 - O3 -5 -14.9522 4.9522 0
O2 - O4 -9 -18.9522 0.9522 0
O2 - O5 -14 -23.9522 -4.0478 1
O2 - O6 -14 -23.9522 -4.0478 1
O3 - O4 -4 -13.9522 5.9522 0
O3 - O5 -9 -18.9522 0.9522 0
O3 - O6 -9 -18.9522 0.9522 0
O4 - O5 -5 -14.9522 4.9522 0
O4 - O6 -5 -14.9522 4.9522 0
O5 - O6 0 -9.9522 9.9522 0
Friedman TS: 14.57143 on df= 5 with p-value= 0.01235945443
[1] "90 % Pairwise CIs on rank sums"
Difference Lower Upper Differ?
M1 - M2 2 -4.4378 8.4378 0
M1 - M3 -6 -12.4378 0.4378 0
M1 - M4 -8 -14.4378 -1.5622 1
M2 - M3 -8 -14.4378 -1.5622 1
M2 - M4 -10 -16.4378 -3.5622 1
M3 - M4 -2 -8.4378 4.4378 0
Friedman TS: 6.80 on df= 3 with p-value= 0.07855315984

4
Program in R:
#PROBLEM-1:MULTIPLE REGRESSION MODEL
alcohol=read.csv("C:/Users/Kalaivanan
Murthy/Desktop/Project
Data/P1.csv",header=T)
ls.str(alcohol)
View(alcohol)
#NAIVE MODEL
vintage=lm(PSA~Cancer.Volume+Weight+Age+BPH+
SVI+Penetration+Score,data=alcohol)
summary(vintage)
AIC(vintage)
#CORRELATION
round(cor(alcohol[c("PSA","Cancer.Volume","W
eight","Age",
"BPH","SVI","Penetration","Score")]),3)
#UPDATE MODEL
vintage.new=update(vintage,.~. -Weight-
Penetration-Age+SVI*Score)
summary(vintage.new)
AIC(vintage.new)
#CHECK ASSUMPTIONS
source("http://www.stat.ufl.edu/~athienit/ch
eck.R")
check(vintage.new,tests=T)
#TRY TRANSFORMATION
library(car)
red.wine=powerTransform(vintage.new)
summary(red.wine)
PSA.new=bcPower(alcohol$PSA,0.1118)
vintage.transform=update(vintage.new, .~. -
PSA+PSA.new)
summary(vintage.transform)
AIC(vintage.transform)
check(vintage.transform,tests=T)
#TRY HIGHER ORDER TERMS
Score.new=bcPower(alcohol$Score,2)
vintage.x=update(vintage.new, .~. -
Score+Score.new-SVI:Score+SVI:Score.new)
summary(vintage.x)
check(vintage.x,tests=T)
#PREDICTION INTERVALS
predict.lm(vintage.x,newdata=data.frame(Canc
er.Volume=4.2633,Weight=22.783,
Age=68,BPH=1.3500,SVI=0,Penetration=0,Score.
new=36),
interval="prediction",level=0.90,se.fit=F)
#PROBLEM-2:PARAMETRIC COMPLETELY RANDOM
DESIGN
jupiter.moon=read.csv("C:/Users/Kalaivanan
ls.str(jupiter.moon)
attach(jupiter.moon)
A=Aromatics[!is.na(Aromatics)];
length(Aromatics)
C=Chloroalkanes[!is.na(Chloroalkanes)];
length (Chloroalkanes)
E=Esters[!is.na(Esters)]; length(Esters)
#CHECK FOR NORMALITY
qqnorm(A,datax=T); qqline(A,datax=T)
qqnorm(C,datax=T); qqline(C,datax=T)
qqnorm(E,datax=T); qqline(E,datax=T)
#CREATE NEW DATAFRAME & ANOVA TABLE
io=data.frame(S.Rate=c(A,C,E),
Solvent=factor(rep(c("A","C","E"),
times=c(length(A),length(C),length(E)))))
callisto=aov(S.Rate~Solvent,data=io)
anova(callisto) #QUESTION:
Sou:Internet: 'Callisto' is already aov
table. why is done again?
summary(callisto)
#BONFERRONI ANALYSIS
source("http://www.stat.ufl.edu/~athienit/Bo
nf.R")
Bonf(S.Rate~Solvent,data=io)
#TUKEY ANALYSIS
ganymede=TukeyHSD(callisto)
print(ganymede)
#CHECK ASSUMPTIONS
source("http://www.stat.ufl.edu/~athienit/ch
eck.R")
check(callisto)

5
#PROBLEM-3: NONPARAMETRIC RANDOMIZED BLOCK
DESIGN
milky.way=read.csv("C:/Users/Kalaivanan
print(milky.way)
attach(milky.way)
#DATAFRAME & LEVELS
andromeda=data.frame(Time=c(O1,O2,O3,O4,O5,O
6),
Operator=factor(rep(c("O1","O2","O3","O4","O
5","O6"),each=4)),
Machine=factor(rep(c("M1","M2","M3","M4"))))
attach(andromeda)
tadpole=aov(Time~Machine+Operator)
summary(tadpole)
xtabs(Time~Machine+Operator,data=andromeda)
#RANK MATRIX
lmc.rank=tapply(Time,Machine,rank);
print(lmc.rank)
lmc.table=matrix(0,4,6,dimnames=list(c(),c("
O1","O2","O3","O4","O5","O6")))
for(i in 1:4){
lmc.table[i,]=lmc.rank[[i]]
}
print(lmc.table)
#FRIEDMAN TEST
friedman.test(Time~Operator|Machine,
data=andromeda)
#FRIEDMAN TEST-2 (DR.ATHIENITIS)
source("http://www.stat.ufl.edu/~athienit/fr
iedman.test2.R")
friedman.test2(Time~Operator|Machine,mc=TRUE
,data=andromeda,level=0.90)
friedman.test2(Time~Machine|Operator,mc=TRUE
,data=andromeda,level=0.90)

1. A university medical center urology group was interested in modeling prostate-specific
antigen (PSA) and a number of prognostic clinical measurements in men with advanced
prostate cancer. Data were collected on 97 men who were about to undergo radical
prostectomies.
http://www.stat.ufl.edu/~athienit/STA6166/assignment3_1.txt
Each line of the data set ha an identification number and provides information on 8
other variables
Variable
Number
Variable Name Description
1 ID number 1-97
2 PSA level Serum prostate-specific antigen level (mg/ml)
3 Cancer volume Estimate of prostate cancer volume (cc)
4 Weight Prostate weight (gramms)
5 Age Age of patient (years)
6 Benign prostatic
hyperplasia
Amount of benign prostatic hyperplasia (cm2
)
7 Seminal vesicle in-
vasion
Presence of seminal vesicle invasion: 1 yes; 0
otherwise
8 Capsular penetra-
tion
Degree of capsular penetration (cm)
9 Gleason score Pathologically determined grade of disease.
(Scores were either 6, 7, or 8 with higher scores
indicating worse prognosis)
Develop a “best” model for predicting PSA and interpret. In addition, create a 90%
prediction interval for PSA levels for an individual who has the following values.
Variable
Number
Variable Name Value
3 Cancer volume 4.2633
4 Weight 22.783
5 Age 68
6 Benign prostatic hyperplasia 1.3500
7 Seminal vesicle invasion 0
8 Capsular penetration 0
9 Gleason score 6
Problem Statement
6

2. Three chemical cleaning solvents are a potential hazardous waste. Independent samples
of solvents from each type were tested and their sorption rates were recorded as a mole
percentage.
Aromatics Chloroalkanes Esters
1.06 0.95 1.58 1.12 0.29 0.43 0.06
0.79 0.65 1.45 0.91 0.06 0.51 0.09
0.82 1.15 0.57 0.83 0.44 0.10 0.17
0.89 1.12 1.16 0.43 0.55 0.53 0.17
1.05 0.61 0.34 0.60
Analyze the dataset (using appropriate methodology) and determine if there are dif-
ferences in the mean sorption rate. If so, what are the differences?
3. Four different machines are being considered in an assembly plant. It is decided that
6 different operators are to be used in a randomized block experiment. The machines
are assigned in a random order to each operator. The amount of time (in seconds)
were recorded:
Operator
Machine 1 2 3 4 5 6
1 42.5 39.3 39.6 39.9 42.9 43.6
2 39.8 40.1 40.5 42.3 42.5 43.1
3 40.2 40.5 41.3 43.4 44.9 45.1
4 41.3 42.2 43.5 44.2 45.9 42.3
Do the machines perform at the same rate using α = 0.10? If not, which differ, by
how much, etc. Give a complete analysis. Does the operator play a significant effect?
Explain.
7

Parametric Regression Modeling and Nonparametric Tests

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Parametric Regression Modeling and Nonparametric Tests

Similar to Parametric Regression Modeling and Nonparametric Tests (20)

More from Kalaivanan Murthy

More from Kalaivanan Murthy (18)

Recently uploaded

Recently uploaded (20)

Parametric Regression Modeling and Nonparametric Tests