SlideShare a Scribd company logo
1 of 41
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
1
How PROC SQL and SASĀ®
Macro Programming Made My Statistical Analysis
Easy? A Case Study on Linear Regression
Venu Perla Ph.D., Independent SAS Programmer, Cross Lanes, WV 25313
Abstract
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS
programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to
show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for
prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how
SAS macros are created on these programs and used for routine analysis of similar data without any further hard
coding in a short period of time.
Introduction
This paper exploited a raw data on two interrelated plant metabolites (X and Y) for generating a linear regression
model. There are 51 observations in this replicated data. This analysis is carried out by SAS
Ā®
9.4 software with
windows operating system. Code is also tested with SAS
Ā®
Studio software 3.1. Code generated HTML-results are
presented here with the STYLE option ā€˜HTMLBlueā€™.
Importing Data from Excel
Data used in this paper is imported from a sheet (XY_Data) of Microsoft
Ā®
Office Excel 97-2003 file (data1.xls) (see
Appendix). PROC IMPORT is utilized to import ā€˜XY_Dataā€™ and renamed it as ā€˜HEALTHā€™ (Table 1). Macro variable,
ā€˜PATHā€™ is created for Excel file path. While calling the FILE statement, a period (ā€˜.ā€™) is used at the end of this macro
variable to avoid misinterpretation by the macro facility. File extension and DBMS statement in the code may be
modified according to the Excel version used.
%let path= C:UsersPerlaDesktop;
title "Importing data from excel";
proc import file="&path.data1.xls"
out=health replace
dbms=xls;
sheet=XY_Data;
getnames=yes;
run;
title "Checking imported data";
proc print data=health;
run;
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
2
Macro ā€˜EXCEL_IMPORTā€™ is defined below for above code for importing Excel files. Where, ā€˜EXCEL_FILE=ā€™ is name
of the excel file to be used; ā€˜EXCEL_SHEET=ā€™ is name of the excel sheet to be imported; and ā€˜DATASET=ā€™ is name of
the output dataset.
%macro excel_import (excel_file= , excel_sheet= , dataset= );
title "Importing data from excel";
proc import file="&path.&excel_file..xls"
out=&dataset replace
dbms=xls;
sheet=&excel_sheet;
getnames=yes;
run;
title "Dataset from imported excel data";
proc print data=&dataset;
run;
%mend excel_import;
Defined macros are saved in a single folder (STATMACROS) for future use.
C:UsersPerlaDocumentsMy SAS Files9.4statmacros
For importing XY_Data, macro EXCEL_IMPORT can be called by following code after specifying location of the
stored macros under global OPTIONS statement. MPRINT, MLOGIC and SYMBOLGEN are the global OPTIONS for
debugging the code.
options mprint mlogic symbolgen;
options mautosource sasautos=
"C:UsersPerlaDocumentsMy SAS Files9.4statmacros";
%excel_import (excel_file=data1, excel_sheet=XY_Data, dataset=health);
Preliminary Analysis of Data
Relationship between X- and Y-variables can be visualized using PROC SGPLOT and PROC CORR.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
3
ods graphics on;
title "Scatter plot of X and Y";
proc sgplot data= health;
scatter x=x y=y;
run;
title "Correlation between X and Y";
proc corr data = health;
var x y;
run;
ods graphics off;
Scatter plot of X and Y indicates that there is no clear relationship between these two variables (Figure 1). Results on
Pearson correlation coefficients indicate a weak correlation between X- and Y-variables (Table 2).
Macro ā€˜SCATTER_CORRā€™ is defined below for above code. Where, ā€˜DATASET=ā€™ is name of the dataset to be used
for analysis; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are the names of the X- and Y-variables, respectively.
%macro scatter_corr (dataset= , xvar= , yvar= );
ods graphics on;
title "Scatter plot of &xvar and &yvar";
proc sgplot data= &dataset;
scatter x=&xvar y=&yvar;
run;
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
4
title "Correlation between &xvar and &yvar";
proc corr data = &dataset;
var &xvar &yvar;
run;
ods graphics off;
%mend scatter_corr;
Macro ā€˜SCATTER_CORRā€™ can be invoked by following statement for the dataset ā€˜HEALTHā€™:
%scatter_corr (dataset=health, xvar=x, yvar=y);
There is an indication of a weak correlation between X and Y (Pearson correlation coefficient: 0.35). Further analysis
is carried out on this raw data using PROC REG and PROC UNIVARIATE. LACKFIT option of MODEL statement in
PROC REG determines whether this linear model is a good fit for this replicated data or not? Residual analysis and
normality tests are carried out using PROC UNIVARIATE with NORMAL option.
ODS graphics on;
title "Regression analysis";
proc reg data = health plots(only)=diagnostics (unpack);
model y = x/lackfit;
output out =mdlres r=resid;
run;
ODS graphics off;
proc univariate data= mdlres normal;
var resid;
run;
Analysis of variance indicates that LACK OF FIT for the linear model is significant (Table 3). This suggests that
further in-depth analysis has to be carried out on this raw data before rejecting the model.
Parameter estimates and adjusted R
2
value for the raw data are provided in Table 4A and 4B, respectively. Adjusted
R
2
value is negligible (0.11).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
5
Distribution of residuals for Y is not normal for the raw data (Figure 2).
Furthermore, observed by predicted plot for Y indicates that all the observations are crowded in the lower left corner
of the plot (Figure 3).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
6
Q-Q plot of residuals for Y further confirms that the raw data is not normal (Figure 4).
This raw data is skewed (Table 5), and significant p values for four tests of normality are the true testimony of non-
normal distribution of data (Table 6).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
7
Macro ā€˜REG_NORMALITYā€™ is defined below for regression analysis and normality tests described above. Where,
ā€˜DATASET=ā€™ is name of the dataset to be used for analysis; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-
variables, respectively.
%macro reg_normality (dataset= ,xvar= ,yvar= );
ODS graphics on;
title "Regression analysis: Dataset &dataset";
proc reg data = &dataset plots(only)=diagnostics (unpack);
model &yvar = &xvar/lackfit;
output out =mdlres r=resid;
run;
proc univariate data= mdlres normal;
var resid;
run;
ODS graphics off;
%mend reg_normality;
Macro ā€˜REG_NORMALITYā€™ can be called by following statement for the dataset ā€˜HEALTHā€™:
%reg_normality (dataset=health, xvar=x, yvar=y);
Preliminary Data Transformation
Box-Cox power transformation can be adopted to normalize this raw data. Data should be converted to non-zero and
non-negative values before testing for Box-Cox power transformation. Following code transforms X and Y variables
into non-zero and/or non- negative variables only when ā€˜0ā€™ or negative values are encountered in the data.
PROC SQL is used to transform X- and Y-variable data into non-zero and non-negative data. Table HEALTH_COX is
created from dataset HEALTH in this procedure. Proc SQL reproduced original data as there are no zeros and
negative values (Table 7; Log 1).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
8
title "Transforming X and Y values into non-zero and non-negative values";
proc sql;
create table health_cox as
select case
when min(x) <=0 then (-(min(x))+x+1)
else x
end as X,
case
when min(y) <=0 then (-(min(y))+y+1)
else y
end as Y
from health;
quit;
proc print data=health_cox;
run;
Macro ā€˜TRANSFORM_ZERO_NEGā€™ is defined below for above PROC SQL code. Where, ā€˜DATASET=ā€™ is the name of
the input dataset to be used for transforming X- and Y-values; ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-
variables to be transformed, respectively; and ā€˜PRE_TRANS_DATASET=ā€™ is name of the output dataset to be created
with transformed X- and Y-variables.
%macro transform_zero_neg (dataset= ,xvar= ,yvar= ,pre_trans_dataset=);
title "Transforming &xvar and &yvar values into non-zero and non-negative
values";
proc sql;
create table &pre_trans_dataset as
select case
when min(&xvar) <=0 then (-(min(&xvar))+&xvar+1)
else &xvar
end as &xvar,
case
when min(&yvar) <=0 then (-(min(&yvar))+&yvar+1)
else &yvar
end as &yvar
from &dataset;
quit;
proc print data=&pre_trans_dataset;
run;
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
9
%mend transform_zero_neg;
Marco ā€˜TRANSFORM_ZERO_NEGā€™ can be invoked by following statement for the dataset ā€˜HEALTHā€™:
%transform_zero_neg
(dataset=health,xvar=x,yvar=y,pre_trans_dataset=health_cox);
Box-Cox Power Transformation
Box-Cox power transformation on non-zero and non-negative data is performed using PROC TRANSREG with ODS
GRAPHICS on.
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= health_cox;
model boxcox(y) = identity(x);
run;
ods graphics off;
Above code generated Box-Cox analysis for Y (Figure 6). Selected lambda (-0.75 at 95% CI) is the exponent to be
used to transform the data into normal shape.
In order to get convenient lambda value, above SAS code is executed without ODS GRAPHICS statement.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
10
proc transreg data = health_cox;
model boxcox(y)=identity(x);
run;
This code generated best lambda, lambda with 95% confidence interval and convenient lambda (Table 8).
Convenient lambda is used for transforming Y-variable in this analysis.
Macro ā€˜BOX_COX_LAMBDAā€™ is defined below for above codes. Where, ā€˜PRE_TRANS_DATASET=ā€™ is name of the
input dataset with non-zero and non-negative values; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-variables,
respectively.
%macro box_cox_lambda (pre_trans_dataset= ,xvar= ,yvar= );
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= &pre_trans_dataset;
model boxcox(&yvar) = identity(&xvar);
run;
ods graphics off;
proc transreg data = &pre_trans_dataset;
model boxcox(&yvar)=identity(&xvar);
run;
%mend box_cox_lambda;
Macro ā€˜BOX_COX_LAMBDAā€™ can be called by following statement for the dataset ā€˜HEALTH_COXā€™:
%box_cox_lambda (pre_trans_dataset=health_cox, xvar=x ,yvar=y);
DATA STEP program is used to transform Y-variable. Code for common convenient lambda values (-2, -1, -0.5, 0,
0.5, 1 and 2); respective Y-transformations (1/Y
2
, 1/Y, 1/sqrt (Y), log (Y), sqrt (Y), Y and Y
2
); and respective
transformed-Y variable names (neg_2_y, neg_1_y, neg_half_y, zero_y, half_y, one_y, and two_y) are incorporated in
the following data step program.
title "Transformation of Y-variable with convenient lambda";
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
11
data health_trans_1;
set health_cox;
neg_2_y = 1/(y**2);
neg_1_y = 1/(y**1);
neg_half_y = 1/(sqrt(y));
zero_y = log(y);
half_y = sqrt(y);
one_y = y**1;
two_y = y**2;
run;
proc print data=health_trans_1;
run;
This code generated dataset ā€˜HEALTH_TRANS_1ā€™ with new transformed Y-variables (Table 9A).
Alternatively, following PROC SQL code is used to generate same dataset with different name.
title "Transformation of Y-values with convenient lambda";
proc sql;
create table health_trans as
select x, y,
1/(y**2) as neg_2_y,
1/(y**1) as neg_1_y,
1/(sqrt(y)) as neg_half_y,
log(y) as zero_y,
sqrt(y) as half_y,
y**1 as one_y,
y**2 as two_y
from health_cox;
quit;
proc print data=health_trans;
run;
PROC SQL generated ā€˜HEALTH_TRANSā€™ table (Table 9B). ā€˜neg_1_yā€™ is the corresponding transformed Y-variable for
the convenient lambda -1. This ā€˜neg_1_yā€™ variable is used for further analysis.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
12
Datasets obtained with DATA STEP program and PROC SQL code are compared with PROC COMPARE. Both the
datasets are equal in all aspects (Log 2).
title "Comparison of output obtained by DATA step and PROC SQL methods";
proc compare
base=health_trans
compare=health_trans_1
printall;
run;
Macro ā€˜TRANSFORM_LAMBDAā€™ is defined below for above PROC SQL code. Where, ā€˜PRE_TRANS_DATASET=ā€™ is
name of the input dataset with non-zero and non-negative X- and Y-values; ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the
X- and Y-variables, respectively; and ā€˜TRANS_DATASET=ā€™ is name of the output dataset with transformed data.
%macro transform_lambda (pre_trans_dataset= ,xvar= ,yvar= ,trans_dataset= );
title "Transformation of &yvar.-values with convenient lambda";
proc sql;
create table &trans_dataset as
select &xvar, &yvar,
1/(&yvar**2) as neg_2_&yvar,
1/(&yvar**1) as neg_1_&yvar,
1/(sqrt(&yvar)) as neg_half_&yvar,
log(&yvar) as zero_&yvar,
sqrt(&yvar) as half_&yvar,
&yvar**1 as one_&yvar,
&yvar**2 as two_&yvar
from &pre_trans_dataset;
quit;
proc print data=&trans_dataset;
run;
%mend transform_lambda;
Macro ā€˜TRANSFORM_LAMBDAā€™ can be invoked by following code for the dataset ā€˜HEALTH_COXā€™:
%transform_lambda (pre_trans_dataset=health_cox, xvar=x, yvar=y,
trans_dataset=health_trans);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
13
Standardization of X-variable
After transformation of Y-variable, in order to obtain meaningful Y-intercept, X-variable is standardized using PROC
STDIZE. Dataset ā€˜HEALTH2ā€™ is generated from table ā€˜HEALTH_TRANSā€™ in this procedure. OPREFIX option is used
to prefix the original X-variable name with the word, ā€˜Unstdized_ā€™. On the other hand, standardized X-values are
stored under X.
title "Standardized X-variable after Y-transformation";
proc stdize data=health_trans
oprefix=Unstdized_
method=mean
out=health2;
var x;
run;
proc print data=health2;
run;
Generated dataset ā€˜HEALTH2ā€™ is shown below with standardized X-variable in the last column as X (Table 10).
Macro ā€˜STDIZE_Xā€™ is defined below for above code. Where, ā€˜TRANS_DATASET=ā€™ is name of the input dataset;
ā€˜TRANS_STDIZE_DATASET=ā€™ is name of the output dataset; and ā€˜XVAR=ā€™ is name of the X-variable to be
standardized.
%macro stdize_x (trans_dataset= ,trans_stdize_dataset= ,xvar= );
title "Standardized &xvar.-variable after Y-transformation";
proc stdize data=&trans_dataset
oprefix=Unstdized_
method=mean
out=&trans_stdize_dataset;
var &xvar;
run;
proc print data=&trans_stdize_dataset;
run;
%mend stdize_x;
Macro ā€˜STDIZE_Xā€™ can be called by following code for the dataset ā€˜HEALTH_TRANSā€™:
%stdize_x (trans_dataset=health_trans, trans_stdize_dataset=health2, xvar=x);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
14
Regression Analysis of Transformed-Standardized Data
Regression analysis and normality tests are performed on the transformed and standardized dataset ā€˜HEALTH2ā€™ by
calling previously defined macro ā€˜REG_NORMALITYā€™. Variable X is the standardized X, and ā€˜neg_1_yā€™ is the
transformed Y.
%reg_normality (dataset=health2, xvar=x, yvar=neg_1_y);
With transformed data, LACK OF FIT for linear model is turned out to be non-significant, which indicates that the
linear model is acceptable for X and Y (Table 11). Parameter estimates for intercept and X are significant (Table
12A). As compared to the raw data, adjusted R
2
value with transformed data is improved from 0.11 to 0.53 (Table
12B). Other results indicate that transformed data is normally distributed (Figures 6-8; Table 13). Non-significant p-
value with Kolmogorov-Smirnov normality test further confirms that data is normally distributed (Table 14). However,
other tests of normality are still significant, which indicates that there is a room for further improvement of data with
respect to normal distribution. There is at least one outlier and leverage observation that is influencing the normal
distribution (Figure 9).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
15
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
16
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
17
Outlier and Influential Observations
R, INFLUENCE, RSTUDENTBYLEVERAGE, DFFITS, DFBETAS and COOKSD options are used in the PROC REG
to generate detailed outlier and or influential observations for the dataset ā€˜HEALTH2ā€™ (Table 15; Figures 10-13).
ods graphics on;
title "Outlier or Influential observations";
proc reg data= health2 plots (only label)= (rstudentbyleverage dffits dfbetas
cooksd);
model neg_1_y = x/r influence;
run;
ods graphics off;
Highest number of asterisks are seen for observation number 43 (Table 15). This observation turned out to be an
outlier and leverage observation (Figure 11). Other results also support that observation number 43 is an outlier and
influencing observation in the dataset ā€˜HEALTH2ā€™ (Figures 10-13).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
18
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
19
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
20
Macro ā€˜OUTLIER_OBSā€™ is defined below for above code. Where, ā€˜INDATA=ā€™ is name of the input dataset; and
ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-variables to be used in the analysis, respectively.
%macro outlier_obs (indata= ,xvar= ,yvar= );
ods graphics on;
title "Outlier or influential observations: Dataset &indata";
proc reg data= &indata plots (only label)= (rstudentbyleverage dffits
dfbetas cooksd);
model &yvar = &xvar/r influence;
run;
ods graphics off;
%mend outlier_obs;
Macro ā€˜OUTLIER_OBSā€™ can be invoked by following statement for the dataset ā€˜HEALTH2ā€™:
%outlier_obs (indata=health2, xvar=x, yvar=neg_1_y);
Slicing One Outlier Observation
Dataset ā€˜SLICEDā€™ for outlier and leverage observation number 43 is created from the dataset ā€˜HEALTH2ā€™ using
following data step code. SAS supplied observation numbers are used to identify and generate a dataset for outlier(s)
with this code. Alternatively, WHERE statement can be used in the PROC REG to omit observations while performing
the regression analysis.
title "Dataset for outlier observation(s): sliced";
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
21
data sliced;
do slice=43;
set health2 point=slice;
output;
end;
stop;
run;
proc print data=sliced;
run;
Dataset ā€˜SLICEDā€™ with one outlier observation is shown below (Table 16):
Macro ā€˜SLICE_OBSā€™ is defined below for above code. Where, ā€˜INDATA=ā€™ is name of the input dataset; ā€˜OB=ā€™ is outlier
observation number; and ā€˜SLICED_DATA=ā€™ is name of the output dataset for storing only the outlier observation.
%macro slice_obs (indata= ,ob=0 ,sliced_data= );
title "Dataset for outlier observation(s): &sliced_data";
data &sliced_data;
do slice=&ob;
set &indata point=slice;
output;
end;
stop;
run;
proc print data=&sliced_data;
run;
%mend slice_obs;
Macro ā€˜SLICE_OBSā€™ can be called by following statement for the outlier observation number 43 of the dataset
ā€˜HEALTH2ā€™:
%slice_obs (indata=health2, ob=43, sliced_data=sliced);
When observation numbers are not provided, use OB=0 to produce missing values for outlier observations. Further
analysis is not affected by these missing values.
%slice_obs (indata=health2, ob=0, sliced_data=sliced);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
22
Dataset without One Outlier Observation
Following DATA STEP program is used to generate dataset ā€˜HEALTH3_1ā€™ with all the observations of the dataset
ā€˜HEALTH2ā€™ except one that matches with the outlier observation of the dataset ā€˜SLICEDā€™ (Table 17A). Here, data has
to be sorted before merging. Note that number of observations in the output dataset ā€˜HEALTH3_1ā€™ are 50 only. Total
real and CPU time required for PROC SORT and DATA STATEMENTS are 0.07 and 0.06 seconds, respectively (Log
3A).
title "Sorting datasets before merging";
proc sort data=health2;
by unstdized_x y;
run;
title "Dataset without outlier observation(s)";
Data health3_1;
merge health2 (in= inhealth)
sliced (in=insliced);
by unstdized_x y;
if inhealth ^= insliced;
run;
proc print data=health3_1;
run;
Alternatively, following PROC SQL code can be used to produce same output in the form of table ā€˜HEALTH3ā€™ (Table
17B). Unlike data step program, merging datasets can be done without sorting in PROC SQL. Real and CPU time
required for PROC SQL is 0.01 and 0.03 seconds, respectively (Log 3B). In other words, PROC SQL code is shorter
and quicker than DATA STEP program in this example.
title "Dataset without outlier observation(s)";
proc sql;
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
23
create table health3 as
select* from health2
except all
select* from sliced;
quit;
proc print data= health3;
run;
Output of DATA STEP program and PROC SQL are compared and verified with PROC COMPARE. All the values in
the datasets ā€˜HEALTH3_1ā€™ and ā€˜HEALTH3ā€™ are exactly equal in all respects (Log 3C).
title "Comparison of datasets: Data step program vs PROC SQL";
proc compare
base=health3
compare=health3_1
printall;
run;
Macro ā€˜NO_OUTLIER_DATAā€™ is defined below for the above PROC SQL code. Where, ā€˜INDATA=ā€™ is name of the
dataset with all the observations; ā€˜SLICED_DATA=ā€™ is name of the dataset with only outlier observations; and
ā€˜OUTDATA=ā€™ is name of the output dataset with all the observations except outliers.
%macro no_outlier_data (indata= ,sliced_data= ,outdata=);
title "&outdata.: Dataset without outlier observation(s)";
proc sql;
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
24
create table &outdata as
select* from &indata
except all
select* from &sliced_data;
quit;
proc print data= &outdata;
run;
%mend no_outlier_data;
Macro ā€˜NO_OUTLIER_DATAā€™ can be invoked by following statement for the datasets ā€˜HEALTH2ā€™ and ā€˜SLICEDā€™:
%no_outlier_data (indata=health2, sliced_data=sliced, outdata=health3);
Regression Analysis without One Outlier Observation
Regression analysis and normality tests are performed on dataset ā€˜HEALTH3ā€™ by invoking previously defined macro,
ā€˜REG_NORMALITYā€™. Dataset ā€˜HEALTH3ā€™ is devoid of one outlier observation. X is the standardized X-variable, and
ā€˜neg_1_yā€™ is the transformed Y-variable.
%reg_normality (dataset=health3, xvar=x, yvar=neg_1_y);
Since LACK OF FIT is non-significant, the linear model for X and Y can be accepted for prediction (Table 18). In the
absence of one outlier observation, as compared to the previous regression analysis, adjusted R
2
value in this
analysis is increased from 0.53 to 0.61 (Tables 19B). Parameter estimates are also modified (Tables 19A). Results
suggest that data is normal (Figures 14 and 15; Table 20). Further improvement may be possible as Shapiro-Wilk
test, one among four tests of the normality, is still significant. Care should be taken to avoid elimination of more
number of observations while improving normal distribution of the data.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
25
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
26
Regression Analysis without Second Outlier Observation
Previously defined macros come in handy while performing this task. No more hard coding is required for this task.
Following macros are invoked to complete this task.
%outlier_obs (indata=health3, xvar=x, yvar=neg_1_y);
%slice_obs (indata=health3, ob=10,sliced_data=sliced2);
%no_outlier_data (indata=health3, sliced_data=sliced2, outdata=health4);
%reg_normality (dataset=health4, xvar=x, yvar=neg_1_y);
Outlier and influential observations are produced by calling macro ā€˜OUTLIER_OBSā€™. Observation number 10 with
highest asterisks in dataset ā€˜HEALTH3ā€™ is identified as an outlier (Table 21; Figure 16).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
27
Macro ā€˜SLICE_OBSā€™ stored one outlier observation (obs # 10) in a dataset named ā€˜SLICED2ā€™ (Table 22).
Table ā€˜HEALTH4ā€™ without second outlier observation is generated by calling macro ā€˜NO_OUTLIER_DATAā€™ (Table 23).
Note that total observations in the dataset ā€˜HEALTH4ā€™ are reduced to 49 (Table 23; Log 4).
Regression analysis and normality tests are invoked by macro ā€˜REG_NORMALITYā€™ on dataset ā€˜HEALTH4ā€™. Similar to
the earlier results, further improvement was achieved in parameter estimates and adjusted R
2
(0.7). Data is more
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
28
normal than previous one (Tables 24, 25A and 25B; Figures 17 and 18). However, Shapiro-Wilk normality test is still
significant (Table 26).
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
29
Regression Analysis without Third Outlier Observation
For the sake of exploration, further analysis is carried out to eliminate third outlier observation in the data.
Interestingly, invoking macro ā€˜OUTLIER_OBSā€™ on ā€˜HEALTH4ā€™ dataset produced two conflicting outlier observations
(obs # 47 and obs # 28) (Table 27; Figures 19-22). For this reason, further analysis with other macros (SLICE_OBS,
NO_OUTLIER_DATA, and REG_NORMALITY) is carried out separately for both, observation number 47 and 28.
%outlier_obs (indata=health4, xvar=x, yvar=neg_1_y);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
30
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
31
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
32
Analysis without outlier observation number 47:
%slice_obs (indata=health4, ob=47,sliced_data=sliced3);
%no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5);
%reg_normality (dataset=health5, xvar=x, yvar=neg_1_y);
Elimination of observation number 47 did not improve the status of normality tests (Table 28). Shapiro-Wilk test is still
significant.
Analysis without outlier observation number 28:
%slice_obs (indata=health4, ob=28,sliced_data=sliced3);
%no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5);
%reg_normality (dataset=health5, xvar=x, yvar=neg_1_y);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
33
Dataset ā€˜SLICED3ā€™ for outlier observation number 28 is generated by invoking macro ā€˜SLICE_OBSā€™ (Log 5). Note that
dataset ā€˜HEALTH5ā€™ contained 48 observations only (Log 6). By eliminating observation number 28, all the four tests
of normality are now non-significant (Table 29). Other data supports these results (Figures 23 and 24). Residual-fit
spread plot indicates accountability of the X-variable for the variation in the model (Figure 25). Data pertaining to
analysis of variance, parameter estimates and adjusted R
2
are presented in Table 30, 31A and 31B, respectively.
Like previous analysis, LACK OF FIT for model is non-significant (Table 30). Adjusted R
2
value is further improved to
0.73.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
34
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
35
Linear Regression Models
From the above analysis, linear regression models for raw, normalized, and normalized data without 1 to 3 outlier
observations are given below. After normalization, data started to exhibit true relationship between X and Y. One
should consider several other factors before proceeding to eliminate outlier observations.
Raw data (non-normal): Y = 0.124 + 0.933X (Adjusted R
2
: 0.11)
Normalized data: Y = 1.240 ā€“ 0.514X (Adjusted R
2
: 0.53)
Normalized data without 1 outlier observation: Y = 1.213 ā€“ 0.653X (Adjusted R
2
: 0.61)
Normalized data without 2 outlier observations: Y = 1.234 ā€“ 0.689X (Adjusted R
2
: 0.70)
Normalized data without 3 outlier observations: Y = 1.218 ā€“ 0.690X (Adjusted R
2
: 0.73)
Scatter Plots after Normalization
Optionally, relationship between X and Y can be visualized by calling macro ā€˜SCATTER_CORRā€™ again for
transformed dataset ā€˜HEALTH2ā€™ and final dataset ā€˜HEALTH5ā€™.
Analysis of dataset ā€˜HEALTH2ā€™:
%scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
36
Analysis of dataset ā€˜HEALTH5ā€™:
%scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
37
Pearson correlation coefficient of non-normal raw data is 0.35 (Table 2). Unlike raw data, both the datasets,
ā€˜HEALTH2ā€™ and ā€˜HEALTH5ā€™ are normal and exhibited strong similar negative relationship between X and Y with
Pearson correlation coefficient values between -0.70 and -0.90 (Figures 26 and 27; Tables 32 and 33).
Master Macros
Three master macros are created for this whole analysis. Upon invoking, these master macros call other macros that
are previously defined.
Master macro 1: IMP_SCATT_CORR_REG_NORMAL
This macro is for initial set of operations (data import from excel file, scatter plot, correlation, regression analysis and
normality tests). There are three macros (EXCEL_IMPORT, SCATTER_CORR and REG_NORMALITY) within this
master macro. All the keyword parameters are described above for each macro. Code for macro EXCEL_IMPORT
may be modified according to the version of excel file.
%macro imp_scatt_corr_reg_normal (excel_file= ,excel_sheet= , dataset= ,xvar= ,
yvar=);
%excel_import (excel_file=&excel_file, excel_sheet=&excel_sheet,
dataset=&dataset);
%scatter_corr (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%mend imp_scatt_corr_reg_normal;
Master macro 2: TRANSFORMATION_BOX_COX
This macro is for transformation of data if it is not normal (conditions apply). There are four macros
(TRANSFORM_ZERO_NEG, BOX_COX_LAMBDA, TRANSFORM_LAMBDA and STDIZE_X) within this master
macro. All the keyword parameters are described above for each macro.
%macro transformation_box_cox (dataset= , pre_trans_dataset= , trans_dataset= ,
trans_stdize_dataset= , xvar= , yvar=);
%transform_zero_neg (dataset=&dataset, xvar=&xvar, yvar=&yvar,
pre_trans_dataset=&pre_trans_dataset);
%box_cox_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar);
%transform_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar, trans_dataset=&trans_dataset);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
38
%stdize_x (trans_dataset=&trans_dataset,
trans_stdize_dataset=&trans_stdize_dataset, xvar=&xvar);
%mend transformation_box_cox;
Master macro 3: REGRESSION_WOUT_OUTLIERS
This macro is for identification and elimination of outlier observations in the data. It utilizes outlier free data for
regression analysis. There are four macros (REG_NORMALITY, OUTLIER_OBS, SLICE_OBS and
NO_OUTLIER_DATA) within this master macro. All the keyword parameters are described above for each macro.
%macro regression_wout_outliers (dataset= , indata= , ob= ,sliced_data= ,
outdata= ,xvar= , yvar=);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%outlier_obs (indata=&indata, xvar=&xvar, yvar=&yvar);
%slice_obs (indata=&indata, ob=&ob, sliced_data=&sliced_data);
%no_outlier_data (indata=&indata, sliced_data=&sliced_data,
outdata=&outdata);
%mend regression_wout_outliers;
Now, whole analysis can be performed on same (or similar) type of data without any further hard coding in a short
period of time by calling master macros in the following manner. It is important to mention location of stored macros
before calling them.
options mautosource sasautos="C:UsersPerlaDocumentsMy SAS
Files9.4statmacros";
%imp_scatt_corr_reg_normal (excel_file=data1,excel_sheet=XY_Data,
dataset=health,xvar=x, yvar=y);
%transformation_box_cox (dataset=health, pre_trans_dataset=health_cox,
trans_dataset=health_trans, trans_stdize_dataset=health2,xvar=x, yvar=y);
For 3
rd
master macro (REGRESSION_WOUT_OUTLIERS), start with a DATASET with transformed Y- and
standardized X-variables. Run this macro first with OB=0, then with OB=obs number(s) to be deleted. In the first run,
identify outlier observation number. In second run, slice this observation from the data. Repeat these two steps until
desired results are achieved with caution.
%regression_wout_outliers (dataset=health2, indata=health2, ob=0,
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=43 is an outlier;
**For improvement, slice ob=43 from health2;
%regression_wout_outliers (dataset=health2, indata=health2, ob=43,
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
%regression_wout_outliers (dataset=health3, indata=health3, ob=0,
sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=10 is an outlier;
**For improvement, slice ob=10 from health3;
%regression_wout_outliers (dataset=health3, indata=health3, ob=10,
sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
39
%regression_wout_outliers (dataset=health4, indata=health4, ob=0,
sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=28 is an outlier;
**For improvement, slice ob=28 from health4;
%regression_wout_outliers (dataset=health4, indata=health4, ob=28,
sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y);
**Run again after satisfaction and use model parameters for final use;
%regression_wout_outliers (dataset=health5, indata=health5, ob=0,
sliced_data=sliced3, outdata=health6, xvar=x , yvar=neg_1_y);
Optionally, after above analysis, the relationship between X and Y can be visualized by calling macro
ā€˜SCATTER_CORRā€™ again for transformed datasets (HEALTH2 and HEALTH5).
%scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);
%scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y);
Conclusion
In this paper, a simple linear regression model is developed for X- and Y-variables after normalizing replicated raw
data in a systematic manner. Various statistical methods, SAS data step programs and SAS SQL procedures are
employed to achieve this goal. PROC SQL is effectively utilized in place of several data step programs. By bringing
SAS macro language on the board, number of SAS statements required to perform each repeatable task is reduced
to a bare minimum. Furthermore, defined macros are effectively utilized to analyze similar data without much hard
coding within a short period of time.
References
Box, G. E. P. and Cox, D. R. 1964. An analysis of transformations, Journal of the Royal Statistical Society (With
discussion), Series B 26 (2): 211ā€“252.
Buthmann A. Making Data Normal Using Box-Cox Power Transformation, iSix Sigma. Available at
http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/
Carpenter, Art. 2004. Carpenterā€™s Complete Guide to the SAS
Ā®
Macro Language, Second Edition, SASĀ® Institute
Inc., Cary, NC, USA.
Lafler, Kirk Paul. 2013. PROC SQL: Beyond the Basics Using SAS
Ā®
, Second Edition, SAS
Ā®
Institute Inc., Cary, NC,
USA.
SAS
Ā®
9.4 Product Documentation, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/94/index.html
SAS/STAT
Ā®
9.3 User's Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#intro_toc.htm
SAS
Ā®
9.2 Macro Language: Reference, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/default/viewer.htm#titlepage.htm
SAS
Ā®
9.3 SQL Procedure Userā€™s Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/sqlproc/63043/HTML/default/viewer.htm#titlepage.htm
Acknowledgments
I would like to thank the organizers for giving me an opportunity to present this paper at Ohio SASĀ® Users
Conference Hosted by CinSUG, CoSUG and CleveSUG on June 1, 2015 at Kingsgate Marriott Conference Center at
the University of Cincinnati, Cincinnati, Ohio.
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
40
Trademark Citations
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. Ā® indicates USA registration.
Author Biography
Venu Perla, Ph.D. is a biomedical researcher with about 14 years of research and teaching
experience in an academic environment. He is currently working in West Virginia. He served the
Purdue University, Oregon Health & Science University, Colorado State University, Kerala
Agricultural University (India) and Mangalayatan University (India) at different capacities. Dr.
Perla has published 13 peer reviewed research papers and 2 book chapters, obtained 1
international patent (on orthopaedic implant device), gave 7 talks and presented 18 posters at
national and international scientific conferences in his professional career. He was trained in
clinical trials and clinical data management. He was also trained in advanced SASĀ® programming
and clinical biostatistics at the University of California, San Diego. Currently, he is actively employing SASĀ®
programming techniques in his research data analysis.
Contact Information
Phone (Cell): (304) 545-5705
Email: venuperla@yahoo.com
LinkedIn: https://www.linkedin.com/pub/venu-perla/2a/700/468
Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
41
Appendix
Table 1. XY_Data sheet of data1.xls (Microsoft Excel 97-2003 file).
X Y
0.4 0.4
0.6 0.5
2.2 15.3
0.4 0.7
0.1 0.5
0.7 0.6
2.5 1.1
0.4 0.5
0.5 0.6
1.3 0.9
0.4 0.4
1.8 1.6
0.5 1.8
0.5 0.5
0.7 0.7
0.3 0.7
1.4 0.9
0.8 0.6
1.3 1
0.6 0.6
1.2 1
2 2.1
0.7 0.6
1.3 1.1
1.1 1
2 1.3
0.6 0.7
2.1 1.7
1.8 1.4
1.2 0.8
1 0.7
2.1 1.5
1.4 1
0.7 0.8
0.5 0.5
0.9 0.7
1.2 0.5
1.1 0.7
2.5 2
1 0.7
0.9 0.8
3 2.7
4.2 1.5
0.9 1
1.9 1.6
1 0.8
1.2 0.7
0.8 0.7
1.4 0.8
1.4 1.4
1 1

More Related Content

What's hot

BIS06 Physical Database Models
BIS06 Physical Database ModelsBIS06 Physical Database Models
BIS06 Physical Database ModelsPrithwis Mukerjee
Ā 
Stata statistics
Stata statisticsStata statistics
Stata statisticsizahn
Ā 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
Ā 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with wekaHein Min Htike
Ā 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
Ā 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
Ā 
Introduction to STATA - Ali Rashed
Introduction to STATA - Ali RashedIntroduction to STATA - Ali Rashed
Introduction to STATA - Ali RashedEconomic Research Forum
Ā 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Ishan Awadhesh
Ā 
Introduction to Stata
Introduction to StataIntroduction to Stata
Introduction to Stataizahn
Ā 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to wekaPamoda Vajiramali
Ā 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKAbutest
Ā 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
Ā 
DATA MINING on WEKA
DATA MINING on WEKADATA MINING on WEKA
DATA MINING on WEKAsatyamkhatri
Ā 
Database Basics and MySQL
Database Basics and MySQLDatabase Basics and MySQL
Database Basics and MySQLJerome Locson
Ā 

What's hot (19)

BIS06 Physical Database Models
BIS06 Physical Database ModelsBIS06 Physical Database Models
BIS06 Physical Database Models
Ā 
Stata statistics
Stata statisticsStata statistics
Stata statistics
Ā 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Ā 
Weka library, JAVA
Weka library, JAVAWeka library, JAVA
Weka library, JAVA
Ā 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
Ā 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
Ā 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
Ā 
Introduction to STATA - Ali Rashed
Introduction to STATA - Ali RashedIntroduction to STATA - Ali Rashed
Introduction to STATA - Ali Rashed
Ā 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
Ā 
Introduction to Stata
Introduction to StataIntroduction to Stata
Introduction to Stata
Ā 
Pivoting approach-eav-data-dinu-2006
Pivoting approach-eav-data-dinu-2006Pivoting approach-eav-data-dinu-2006
Pivoting approach-eav-data-dinu-2006
Ā 
Weka
WekaWeka
Weka
Ā 
Stata tutorial
Stata tutorialStata tutorial
Stata tutorial
Ā 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
Ā 
Wek1
Wek1Wek1
Wek1
Ā 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
Ā 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
Ā 
DATA MINING on WEKA
DATA MINING on WEKADATA MINING on WEKA
DATA MINING on WEKA
Ā 
Database Basics and MySQL
Database Basics and MySQLDatabase Basics and MySQL
Database Basics and MySQL
Ā 

Similar to Ohio SAS Users Conference Presentation on Linear Regression

224-2009
224-2009224-2009
224-2009Hong Zhang
Ā 
4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdfssuser47ab7b2
Ā 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
Ā 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
Ā 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
Ā 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
Ā 
Standardization of ā€œSafety Drugā€ Reporting Applications
Standardization of ā€œSafety Drugā€ Reporting ApplicationsStandardization of ā€œSafety Drugā€ Reporting Applications
Standardization of ā€œSafety Drugā€ Reporting Applicationshalleyzand
Ā 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
Ā 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Ā 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Ā 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
Ā 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
Ā 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysisRaman Kannan
Ā 
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...Edureka!
Ā 
SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6Mahesh Vallampati
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
Ā 

Similar to Ohio SAS Users Conference Presentation on Linear Regression (20)

224-2009
224-2009224-2009
224-2009
Ā 
4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf
Ā 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
Ā 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Ā 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
Ā 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
Ā 
8606BICA2.pptx
8606BICA2.pptx8606BICA2.pptx
8606BICA2.pptx
Ā 
Standardization of ā€œSafety Drugā€ Reporting Applications
Standardization of ā€œSafety Drugā€ Reporting ApplicationsStandardization of ā€œSafety Drugā€ Reporting Applications
Standardization of ā€œSafety Drugā€ Reporting Applications
Ā 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Ā 
PheWAS-package.pdf
PheWAS-package.pdfPheWAS-package.pdf
PheWAS-package.pdf
Ā 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
Ā 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
Ā 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
Ā 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Ā 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
Ā 
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners ...
Ā 
SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6
Ā 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Ā 

Recently uploaded

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
Ā 
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callshivangimorya083
Ā 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
Ā 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
Ā 

Recently uploaded (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Ā 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Ā 
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Ā 
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
Ā 
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
Ā 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
Ā 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Ā 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Ā 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
Ā 

Ohio SAS Users Conference Presentation on Linear Regression

  • 1. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 1 How PROC SQL and SASĀ® Macro Programming Made My Statistical Analysis Easy? A Case Study on Linear Regression Venu Perla Ph.D., Independent SAS Programmer, Cross Lanes, WV 25313 Abstract Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time. Introduction This paper exploited a raw data on two interrelated plant metabolites (X and Y) for generating a linear regression model. There are 51 observations in this replicated data. This analysis is carried out by SAS Ā® 9.4 software with windows operating system. Code is also tested with SAS Ā® Studio software 3.1. Code generated HTML-results are presented here with the STYLE option ā€˜HTMLBlueā€™. Importing Data from Excel Data used in this paper is imported from a sheet (XY_Data) of Microsoft Ā® Office Excel 97-2003 file (data1.xls) (see Appendix). PROC IMPORT is utilized to import ā€˜XY_Dataā€™ and renamed it as ā€˜HEALTHā€™ (Table 1). Macro variable, ā€˜PATHā€™ is created for Excel file path. While calling the FILE statement, a period (ā€˜.ā€™) is used at the end of this macro variable to avoid misinterpretation by the macro facility. File extension and DBMS statement in the code may be modified according to the Excel version used. %let path= C:UsersPerlaDesktop; title "Importing data from excel"; proc import file="&path.data1.xls" out=health replace dbms=xls; sheet=XY_Data; getnames=yes; run; title "Checking imported data"; proc print data=health; run;
  • 2. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 2 Macro ā€˜EXCEL_IMPORTā€™ is defined below for above code for importing Excel files. Where, ā€˜EXCEL_FILE=ā€™ is name of the excel file to be used; ā€˜EXCEL_SHEET=ā€™ is name of the excel sheet to be imported; and ā€˜DATASET=ā€™ is name of the output dataset. %macro excel_import (excel_file= , excel_sheet= , dataset= ); title "Importing data from excel"; proc import file="&path.&excel_file..xls" out=&dataset replace dbms=xls; sheet=&excel_sheet; getnames=yes; run; title "Dataset from imported excel data"; proc print data=&dataset; run; %mend excel_import; Defined macros are saved in a single folder (STATMACROS) for future use. C:UsersPerlaDocumentsMy SAS Files9.4statmacros For importing XY_Data, macro EXCEL_IMPORT can be called by following code after specifying location of the stored macros under global OPTIONS statement. MPRINT, MLOGIC and SYMBOLGEN are the global OPTIONS for debugging the code. options mprint mlogic symbolgen; options mautosource sasautos= "C:UsersPerlaDocumentsMy SAS Files9.4statmacros"; %excel_import (excel_file=data1, excel_sheet=XY_Data, dataset=health); Preliminary Analysis of Data Relationship between X- and Y-variables can be visualized using PROC SGPLOT and PROC CORR.
  • 3. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 3 ods graphics on; title "Scatter plot of X and Y"; proc sgplot data= health; scatter x=x y=y; run; title "Correlation between X and Y"; proc corr data = health; var x y; run; ods graphics off; Scatter plot of X and Y indicates that there is no clear relationship between these two variables (Figure 1). Results on Pearson correlation coefficients indicate a weak correlation between X- and Y-variables (Table 2). Macro ā€˜SCATTER_CORRā€™ is defined below for above code. Where, ā€˜DATASET=ā€™ is name of the dataset to be used for analysis; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are the names of the X- and Y-variables, respectively. %macro scatter_corr (dataset= , xvar= , yvar= ); ods graphics on; title "Scatter plot of &xvar and &yvar"; proc sgplot data= &dataset; scatter x=&xvar y=&yvar; run;
  • 4. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 4 title "Correlation between &xvar and &yvar"; proc corr data = &dataset; var &xvar &yvar; run; ods graphics off; %mend scatter_corr; Macro ā€˜SCATTER_CORRā€™ can be invoked by following statement for the dataset ā€˜HEALTHā€™: %scatter_corr (dataset=health, xvar=x, yvar=y); There is an indication of a weak correlation between X and Y (Pearson correlation coefficient: 0.35). Further analysis is carried out on this raw data using PROC REG and PROC UNIVARIATE. LACKFIT option of MODEL statement in PROC REG determines whether this linear model is a good fit for this replicated data or not? Residual analysis and normality tests are carried out using PROC UNIVARIATE with NORMAL option. ODS graphics on; title "Regression analysis"; proc reg data = health plots(only)=diagnostics (unpack); model y = x/lackfit; output out =mdlres r=resid; run; ODS graphics off; proc univariate data= mdlres normal; var resid; run; Analysis of variance indicates that LACK OF FIT for the linear model is significant (Table 3). This suggests that further in-depth analysis has to be carried out on this raw data before rejecting the model. Parameter estimates and adjusted R 2 value for the raw data are provided in Table 4A and 4B, respectively. Adjusted R 2 value is negligible (0.11).
  • 5. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 5 Distribution of residuals for Y is not normal for the raw data (Figure 2). Furthermore, observed by predicted plot for Y indicates that all the observations are crowded in the lower left corner of the plot (Figure 3).
  • 6. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 6 Q-Q plot of residuals for Y further confirms that the raw data is not normal (Figure 4). This raw data is skewed (Table 5), and significant p values for four tests of normality are the true testimony of non- normal distribution of data (Table 6).
  • 7. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 7 Macro ā€˜REG_NORMALITYā€™ is defined below for regression analysis and normality tests described above. Where, ā€˜DATASET=ā€™ is name of the dataset to be used for analysis; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y- variables, respectively. %macro reg_normality (dataset= ,xvar= ,yvar= ); ODS graphics on; title "Regression analysis: Dataset &dataset"; proc reg data = &dataset plots(only)=diagnostics (unpack); model &yvar = &xvar/lackfit; output out =mdlres r=resid; run; proc univariate data= mdlres normal; var resid; run; ODS graphics off; %mend reg_normality; Macro ā€˜REG_NORMALITYā€™ can be called by following statement for the dataset ā€˜HEALTHā€™: %reg_normality (dataset=health, xvar=x, yvar=y); Preliminary Data Transformation Box-Cox power transformation can be adopted to normalize this raw data. Data should be converted to non-zero and non-negative values before testing for Box-Cox power transformation. Following code transforms X and Y variables into non-zero and/or non- negative variables only when ā€˜0ā€™ or negative values are encountered in the data. PROC SQL is used to transform X- and Y-variable data into non-zero and non-negative data. Table HEALTH_COX is created from dataset HEALTH in this procedure. Proc SQL reproduced original data as there are no zeros and negative values (Table 7; Log 1).
  • 8. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 8 title "Transforming X and Y values into non-zero and non-negative values"; proc sql; create table health_cox as select case when min(x) <=0 then (-(min(x))+x+1) else x end as X, case when min(y) <=0 then (-(min(y))+y+1) else y end as Y from health; quit; proc print data=health_cox; run; Macro ā€˜TRANSFORM_ZERO_NEGā€™ is defined below for above PROC SQL code. Where, ā€˜DATASET=ā€™ is the name of the input dataset to be used for transforming X- and Y-values; ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y- variables to be transformed, respectively; and ā€˜PRE_TRANS_DATASET=ā€™ is name of the output dataset to be created with transformed X- and Y-variables. %macro transform_zero_neg (dataset= ,xvar= ,yvar= ,pre_trans_dataset=); title "Transforming &xvar and &yvar values into non-zero and non-negative values"; proc sql; create table &pre_trans_dataset as select case when min(&xvar) <=0 then (-(min(&xvar))+&xvar+1) else &xvar end as &xvar, case when min(&yvar) <=0 then (-(min(&yvar))+&yvar+1) else &yvar end as &yvar from &dataset; quit; proc print data=&pre_trans_dataset; run;
  • 9. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 9 %mend transform_zero_neg; Marco ā€˜TRANSFORM_ZERO_NEGā€™ can be invoked by following statement for the dataset ā€˜HEALTHā€™: %transform_zero_neg (dataset=health,xvar=x,yvar=y,pre_trans_dataset=health_cox); Box-Cox Power Transformation Box-Cox power transformation on non-zero and non-negative data is performed using PROC TRANSREG with ODS GRAPHICS on. title "Box-Cox power transformation: Identification of right exponent (Lambda)"; ods graphics on; proc transreg data= health_cox; model boxcox(y) = identity(x); run; ods graphics off; Above code generated Box-Cox analysis for Y (Figure 6). Selected lambda (-0.75 at 95% CI) is the exponent to be used to transform the data into normal shape. In order to get convenient lambda value, above SAS code is executed without ODS GRAPHICS statement.
  • 10. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 10 proc transreg data = health_cox; model boxcox(y)=identity(x); run; This code generated best lambda, lambda with 95% confidence interval and convenient lambda (Table 8). Convenient lambda is used for transforming Y-variable in this analysis. Macro ā€˜BOX_COX_LAMBDAā€™ is defined below for above codes. Where, ā€˜PRE_TRANS_DATASET=ā€™ is name of the input dataset with non-zero and non-negative values; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-variables, respectively. %macro box_cox_lambda (pre_trans_dataset= ,xvar= ,yvar= ); title "Box-Cox power transformation: Identification of right exponent (Lambda)"; ods graphics on; proc transreg data= &pre_trans_dataset; model boxcox(&yvar) = identity(&xvar); run; ods graphics off; proc transreg data = &pre_trans_dataset; model boxcox(&yvar)=identity(&xvar); run; %mend box_cox_lambda; Macro ā€˜BOX_COX_LAMBDAā€™ can be called by following statement for the dataset ā€˜HEALTH_COXā€™: %box_cox_lambda (pre_trans_dataset=health_cox, xvar=x ,yvar=y); DATA STEP program is used to transform Y-variable. Code for common convenient lambda values (-2, -1, -0.5, 0, 0.5, 1 and 2); respective Y-transformations (1/Y 2 , 1/Y, 1/sqrt (Y), log (Y), sqrt (Y), Y and Y 2 ); and respective transformed-Y variable names (neg_2_y, neg_1_y, neg_half_y, zero_y, half_y, one_y, and two_y) are incorporated in the following data step program. title "Transformation of Y-variable with convenient lambda";
  • 11. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 11 data health_trans_1; set health_cox; neg_2_y = 1/(y**2); neg_1_y = 1/(y**1); neg_half_y = 1/(sqrt(y)); zero_y = log(y); half_y = sqrt(y); one_y = y**1; two_y = y**2; run; proc print data=health_trans_1; run; This code generated dataset ā€˜HEALTH_TRANS_1ā€™ with new transformed Y-variables (Table 9A). Alternatively, following PROC SQL code is used to generate same dataset with different name. title "Transformation of Y-values with convenient lambda"; proc sql; create table health_trans as select x, y, 1/(y**2) as neg_2_y, 1/(y**1) as neg_1_y, 1/(sqrt(y)) as neg_half_y, log(y) as zero_y, sqrt(y) as half_y, y**1 as one_y, y**2 as two_y from health_cox; quit; proc print data=health_trans; run; PROC SQL generated ā€˜HEALTH_TRANSā€™ table (Table 9B). ā€˜neg_1_yā€™ is the corresponding transformed Y-variable for the convenient lambda -1. This ā€˜neg_1_yā€™ variable is used for further analysis.
  • 12. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 12 Datasets obtained with DATA STEP program and PROC SQL code are compared with PROC COMPARE. Both the datasets are equal in all aspects (Log 2). title "Comparison of output obtained by DATA step and PROC SQL methods"; proc compare base=health_trans compare=health_trans_1 printall; run; Macro ā€˜TRANSFORM_LAMBDAā€™ is defined below for above PROC SQL code. Where, ā€˜PRE_TRANS_DATASET=ā€™ is name of the input dataset with non-zero and non-negative X- and Y-values; ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-variables, respectively; and ā€˜TRANS_DATASET=ā€™ is name of the output dataset with transformed data. %macro transform_lambda (pre_trans_dataset= ,xvar= ,yvar= ,trans_dataset= ); title "Transformation of &yvar.-values with convenient lambda"; proc sql; create table &trans_dataset as select &xvar, &yvar, 1/(&yvar**2) as neg_2_&yvar, 1/(&yvar**1) as neg_1_&yvar, 1/(sqrt(&yvar)) as neg_half_&yvar, log(&yvar) as zero_&yvar, sqrt(&yvar) as half_&yvar, &yvar**1 as one_&yvar, &yvar**2 as two_&yvar from &pre_trans_dataset; quit; proc print data=&trans_dataset; run; %mend transform_lambda; Macro ā€˜TRANSFORM_LAMBDAā€™ can be invoked by following code for the dataset ā€˜HEALTH_COXā€™: %transform_lambda (pre_trans_dataset=health_cox, xvar=x, yvar=y, trans_dataset=health_trans);
  • 13. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 13 Standardization of X-variable After transformation of Y-variable, in order to obtain meaningful Y-intercept, X-variable is standardized using PROC STDIZE. Dataset ā€˜HEALTH2ā€™ is generated from table ā€˜HEALTH_TRANSā€™ in this procedure. OPREFIX option is used to prefix the original X-variable name with the word, ā€˜Unstdized_ā€™. On the other hand, standardized X-values are stored under X. title "Standardized X-variable after Y-transformation"; proc stdize data=health_trans oprefix=Unstdized_ method=mean out=health2; var x; run; proc print data=health2; run; Generated dataset ā€˜HEALTH2ā€™ is shown below with standardized X-variable in the last column as X (Table 10). Macro ā€˜STDIZE_Xā€™ is defined below for above code. Where, ā€˜TRANS_DATASET=ā€™ is name of the input dataset; ā€˜TRANS_STDIZE_DATASET=ā€™ is name of the output dataset; and ā€˜XVAR=ā€™ is name of the X-variable to be standardized. %macro stdize_x (trans_dataset= ,trans_stdize_dataset= ,xvar= ); title "Standardized &xvar.-variable after Y-transformation"; proc stdize data=&trans_dataset oprefix=Unstdized_ method=mean out=&trans_stdize_dataset; var &xvar; run; proc print data=&trans_stdize_dataset; run; %mend stdize_x; Macro ā€˜STDIZE_Xā€™ can be called by following code for the dataset ā€˜HEALTH_TRANSā€™: %stdize_x (trans_dataset=health_trans, trans_stdize_dataset=health2, xvar=x);
  • 14. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 14 Regression Analysis of Transformed-Standardized Data Regression analysis and normality tests are performed on the transformed and standardized dataset ā€˜HEALTH2ā€™ by calling previously defined macro ā€˜REG_NORMALITYā€™. Variable X is the standardized X, and ā€˜neg_1_yā€™ is the transformed Y. %reg_normality (dataset=health2, xvar=x, yvar=neg_1_y); With transformed data, LACK OF FIT for linear model is turned out to be non-significant, which indicates that the linear model is acceptable for X and Y (Table 11). Parameter estimates for intercept and X are significant (Table 12A). As compared to the raw data, adjusted R 2 value with transformed data is improved from 0.11 to 0.53 (Table 12B). Other results indicate that transformed data is normally distributed (Figures 6-8; Table 13). Non-significant p- value with Kolmogorov-Smirnov normality test further confirms that data is normally distributed (Table 14). However, other tests of normality are still significant, which indicates that there is a room for further improvement of data with respect to normal distribution. There is at least one outlier and leverage observation that is influencing the normal distribution (Figure 9).
  • 15. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 15
  • 16. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 16
  • 17. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 17 Outlier and Influential Observations R, INFLUENCE, RSTUDENTBYLEVERAGE, DFFITS, DFBETAS and COOKSD options are used in the PROC REG to generate detailed outlier and or influential observations for the dataset ā€˜HEALTH2ā€™ (Table 15; Figures 10-13). ods graphics on; title "Outlier or Influential observations"; proc reg data= health2 plots (only label)= (rstudentbyleverage dffits dfbetas cooksd); model neg_1_y = x/r influence; run; ods graphics off; Highest number of asterisks are seen for observation number 43 (Table 15). This observation turned out to be an outlier and leverage observation (Figure 11). Other results also support that observation number 43 is an outlier and influencing observation in the dataset ā€˜HEALTH2ā€™ (Figures 10-13).
  • 18. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 18
  • 19. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 19
  • 20. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 20 Macro ā€˜OUTLIER_OBSā€™ is defined below for above code. Where, ā€˜INDATA=ā€™ is name of the input dataset; and ā€˜XVAR=ā€™ and ā€˜YVAR=ā€™ are names of the X- and Y-variables to be used in the analysis, respectively. %macro outlier_obs (indata= ,xvar= ,yvar= ); ods graphics on; title "Outlier or influential observations: Dataset &indata"; proc reg data= &indata plots (only label)= (rstudentbyleverage dffits dfbetas cooksd); model &yvar = &xvar/r influence; run; ods graphics off; %mend outlier_obs; Macro ā€˜OUTLIER_OBSā€™ can be invoked by following statement for the dataset ā€˜HEALTH2ā€™: %outlier_obs (indata=health2, xvar=x, yvar=neg_1_y); Slicing One Outlier Observation Dataset ā€˜SLICEDā€™ for outlier and leverage observation number 43 is created from the dataset ā€˜HEALTH2ā€™ using following data step code. SAS supplied observation numbers are used to identify and generate a dataset for outlier(s) with this code. Alternatively, WHERE statement can be used in the PROC REG to omit observations while performing the regression analysis. title "Dataset for outlier observation(s): sliced";
  • 21. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 21 data sliced; do slice=43; set health2 point=slice; output; end; stop; run; proc print data=sliced; run; Dataset ā€˜SLICEDā€™ with one outlier observation is shown below (Table 16): Macro ā€˜SLICE_OBSā€™ is defined below for above code. Where, ā€˜INDATA=ā€™ is name of the input dataset; ā€˜OB=ā€™ is outlier observation number; and ā€˜SLICED_DATA=ā€™ is name of the output dataset for storing only the outlier observation. %macro slice_obs (indata= ,ob=0 ,sliced_data= ); title "Dataset for outlier observation(s): &sliced_data"; data &sliced_data; do slice=&ob; set &indata point=slice; output; end; stop; run; proc print data=&sliced_data; run; %mend slice_obs; Macro ā€˜SLICE_OBSā€™ can be called by following statement for the outlier observation number 43 of the dataset ā€˜HEALTH2ā€™: %slice_obs (indata=health2, ob=43, sliced_data=sliced); When observation numbers are not provided, use OB=0 to produce missing values for outlier observations. Further analysis is not affected by these missing values. %slice_obs (indata=health2, ob=0, sliced_data=sliced);
  • 22. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 22 Dataset without One Outlier Observation Following DATA STEP program is used to generate dataset ā€˜HEALTH3_1ā€™ with all the observations of the dataset ā€˜HEALTH2ā€™ except one that matches with the outlier observation of the dataset ā€˜SLICEDā€™ (Table 17A). Here, data has to be sorted before merging. Note that number of observations in the output dataset ā€˜HEALTH3_1ā€™ are 50 only. Total real and CPU time required for PROC SORT and DATA STATEMENTS are 0.07 and 0.06 seconds, respectively (Log 3A). title "Sorting datasets before merging"; proc sort data=health2; by unstdized_x y; run; title "Dataset without outlier observation(s)"; Data health3_1; merge health2 (in= inhealth) sliced (in=insliced); by unstdized_x y; if inhealth ^= insliced; run; proc print data=health3_1; run; Alternatively, following PROC SQL code can be used to produce same output in the form of table ā€˜HEALTH3ā€™ (Table 17B). Unlike data step program, merging datasets can be done without sorting in PROC SQL. Real and CPU time required for PROC SQL is 0.01 and 0.03 seconds, respectively (Log 3B). In other words, PROC SQL code is shorter and quicker than DATA STEP program in this example. title "Dataset without outlier observation(s)"; proc sql;
  • 23. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 23 create table health3 as select* from health2 except all select* from sliced; quit; proc print data= health3; run; Output of DATA STEP program and PROC SQL are compared and verified with PROC COMPARE. All the values in the datasets ā€˜HEALTH3_1ā€™ and ā€˜HEALTH3ā€™ are exactly equal in all respects (Log 3C). title "Comparison of datasets: Data step program vs PROC SQL"; proc compare base=health3 compare=health3_1 printall; run; Macro ā€˜NO_OUTLIER_DATAā€™ is defined below for the above PROC SQL code. Where, ā€˜INDATA=ā€™ is name of the dataset with all the observations; ā€˜SLICED_DATA=ā€™ is name of the dataset with only outlier observations; and ā€˜OUTDATA=ā€™ is name of the output dataset with all the observations except outliers. %macro no_outlier_data (indata= ,sliced_data= ,outdata=); title "&outdata.: Dataset without outlier observation(s)"; proc sql;
  • 24. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 24 create table &outdata as select* from &indata except all select* from &sliced_data; quit; proc print data= &outdata; run; %mend no_outlier_data; Macro ā€˜NO_OUTLIER_DATAā€™ can be invoked by following statement for the datasets ā€˜HEALTH2ā€™ and ā€˜SLICEDā€™: %no_outlier_data (indata=health2, sliced_data=sliced, outdata=health3); Regression Analysis without One Outlier Observation Regression analysis and normality tests are performed on dataset ā€˜HEALTH3ā€™ by invoking previously defined macro, ā€˜REG_NORMALITYā€™. Dataset ā€˜HEALTH3ā€™ is devoid of one outlier observation. X is the standardized X-variable, and ā€˜neg_1_yā€™ is the transformed Y-variable. %reg_normality (dataset=health3, xvar=x, yvar=neg_1_y); Since LACK OF FIT is non-significant, the linear model for X and Y can be accepted for prediction (Table 18). In the absence of one outlier observation, as compared to the previous regression analysis, adjusted R 2 value in this analysis is increased from 0.53 to 0.61 (Tables 19B). Parameter estimates are also modified (Tables 19A). Results suggest that data is normal (Figures 14 and 15; Table 20). Further improvement may be possible as Shapiro-Wilk test, one among four tests of the normality, is still significant. Care should be taken to avoid elimination of more number of observations while improving normal distribution of the data.
  • 25. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 25
  • 26. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 26 Regression Analysis without Second Outlier Observation Previously defined macros come in handy while performing this task. No more hard coding is required for this task. Following macros are invoked to complete this task. %outlier_obs (indata=health3, xvar=x, yvar=neg_1_y); %slice_obs (indata=health3, ob=10,sliced_data=sliced2); %no_outlier_data (indata=health3, sliced_data=sliced2, outdata=health4); %reg_normality (dataset=health4, xvar=x, yvar=neg_1_y); Outlier and influential observations are produced by calling macro ā€˜OUTLIER_OBSā€™. Observation number 10 with highest asterisks in dataset ā€˜HEALTH3ā€™ is identified as an outlier (Table 21; Figure 16).
  • 27. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 27 Macro ā€˜SLICE_OBSā€™ stored one outlier observation (obs # 10) in a dataset named ā€˜SLICED2ā€™ (Table 22). Table ā€˜HEALTH4ā€™ without second outlier observation is generated by calling macro ā€˜NO_OUTLIER_DATAā€™ (Table 23). Note that total observations in the dataset ā€˜HEALTH4ā€™ are reduced to 49 (Table 23; Log 4). Regression analysis and normality tests are invoked by macro ā€˜REG_NORMALITYā€™ on dataset ā€˜HEALTH4ā€™. Similar to the earlier results, further improvement was achieved in parameter estimates and adjusted R 2 (0.7). Data is more
  • 28. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 28 normal than previous one (Tables 24, 25A and 25B; Figures 17 and 18). However, Shapiro-Wilk normality test is still significant (Table 26).
  • 29. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 29 Regression Analysis without Third Outlier Observation For the sake of exploration, further analysis is carried out to eliminate third outlier observation in the data. Interestingly, invoking macro ā€˜OUTLIER_OBSā€™ on ā€˜HEALTH4ā€™ dataset produced two conflicting outlier observations (obs # 47 and obs # 28) (Table 27; Figures 19-22). For this reason, further analysis with other macros (SLICE_OBS, NO_OUTLIER_DATA, and REG_NORMALITY) is carried out separately for both, observation number 47 and 28. %outlier_obs (indata=health4, xvar=x, yvar=neg_1_y);
  • 30. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 30
  • 31. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 31
  • 32. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 32 Analysis without outlier observation number 47: %slice_obs (indata=health4, ob=47,sliced_data=sliced3); %no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5); %reg_normality (dataset=health5, xvar=x, yvar=neg_1_y); Elimination of observation number 47 did not improve the status of normality tests (Table 28). Shapiro-Wilk test is still significant. Analysis without outlier observation number 28: %slice_obs (indata=health4, ob=28,sliced_data=sliced3); %no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5); %reg_normality (dataset=health5, xvar=x, yvar=neg_1_y);
  • 33. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 33 Dataset ā€˜SLICED3ā€™ for outlier observation number 28 is generated by invoking macro ā€˜SLICE_OBSā€™ (Log 5). Note that dataset ā€˜HEALTH5ā€™ contained 48 observations only (Log 6). By eliminating observation number 28, all the four tests of normality are now non-significant (Table 29). Other data supports these results (Figures 23 and 24). Residual-fit spread plot indicates accountability of the X-variable for the variation in the model (Figure 25). Data pertaining to analysis of variance, parameter estimates and adjusted R 2 are presented in Table 30, 31A and 31B, respectively. Like previous analysis, LACK OF FIT for model is non-significant (Table 30). Adjusted R 2 value is further improved to 0.73.
  • 34. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 34
  • 35. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 35 Linear Regression Models From the above analysis, linear regression models for raw, normalized, and normalized data without 1 to 3 outlier observations are given below. After normalization, data started to exhibit true relationship between X and Y. One should consider several other factors before proceeding to eliminate outlier observations. Raw data (non-normal): Y = 0.124 + 0.933X (Adjusted R 2 : 0.11) Normalized data: Y = 1.240 ā€“ 0.514X (Adjusted R 2 : 0.53) Normalized data without 1 outlier observation: Y = 1.213 ā€“ 0.653X (Adjusted R 2 : 0.61) Normalized data without 2 outlier observations: Y = 1.234 ā€“ 0.689X (Adjusted R 2 : 0.70) Normalized data without 3 outlier observations: Y = 1.218 ā€“ 0.690X (Adjusted R 2 : 0.73) Scatter Plots after Normalization Optionally, relationship between X and Y can be visualized by calling macro ā€˜SCATTER_CORRā€™ again for transformed dataset ā€˜HEALTH2ā€™ and final dataset ā€˜HEALTH5ā€™. Analysis of dataset ā€˜HEALTH2ā€™: %scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);
  • 36. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 36 Analysis of dataset ā€˜HEALTH5ā€™: %scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y);
  • 37. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 37 Pearson correlation coefficient of non-normal raw data is 0.35 (Table 2). Unlike raw data, both the datasets, ā€˜HEALTH2ā€™ and ā€˜HEALTH5ā€™ are normal and exhibited strong similar negative relationship between X and Y with Pearson correlation coefficient values between -0.70 and -0.90 (Figures 26 and 27; Tables 32 and 33). Master Macros Three master macros are created for this whole analysis. Upon invoking, these master macros call other macros that are previously defined. Master macro 1: IMP_SCATT_CORR_REG_NORMAL This macro is for initial set of operations (data import from excel file, scatter plot, correlation, regression analysis and normality tests). There are three macros (EXCEL_IMPORT, SCATTER_CORR and REG_NORMALITY) within this master macro. All the keyword parameters are described above for each macro. Code for macro EXCEL_IMPORT may be modified according to the version of excel file. %macro imp_scatt_corr_reg_normal (excel_file= ,excel_sheet= , dataset= ,xvar= , yvar=); %excel_import (excel_file=&excel_file, excel_sheet=&excel_sheet, dataset=&dataset); %scatter_corr (dataset=&dataset, xvar=&xvar, yvar=&yvar); %reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar); %mend imp_scatt_corr_reg_normal; Master macro 2: TRANSFORMATION_BOX_COX This macro is for transformation of data if it is not normal (conditions apply). There are four macros (TRANSFORM_ZERO_NEG, BOX_COX_LAMBDA, TRANSFORM_LAMBDA and STDIZE_X) within this master macro. All the keyword parameters are described above for each macro. %macro transformation_box_cox (dataset= , pre_trans_dataset= , trans_dataset= , trans_stdize_dataset= , xvar= , yvar=); %transform_zero_neg (dataset=&dataset, xvar=&xvar, yvar=&yvar, pre_trans_dataset=&pre_trans_dataset); %box_cox_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar, yvar=&yvar); %transform_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar, yvar=&yvar, trans_dataset=&trans_dataset);
  • 38. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 38 %stdize_x (trans_dataset=&trans_dataset, trans_stdize_dataset=&trans_stdize_dataset, xvar=&xvar); %mend transformation_box_cox; Master macro 3: REGRESSION_WOUT_OUTLIERS This macro is for identification and elimination of outlier observations in the data. It utilizes outlier free data for regression analysis. There are four macros (REG_NORMALITY, OUTLIER_OBS, SLICE_OBS and NO_OUTLIER_DATA) within this master macro. All the keyword parameters are described above for each macro. %macro regression_wout_outliers (dataset= , indata= , ob= ,sliced_data= , outdata= ,xvar= , yvar=); %reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar); %outlier_obs (indata=&indata, xvar=&xvar, yvar=&yvar); %slice_obs (indata=&indata, ob=&ob, sliced_data=&sliced_data); %no_outlier_data (indata=&indata, sliced_data=&sliced_data, outdata=&outdata); %mend regression_wout_outliers; Now, whole analysis can be performed on same (or similar) type of data without any further hard coding in a short period of time by calling master macros in the following manner. It is important to mention location of stored macros before calling them. options mautosource sasautos="C:UsersPerlaDocumentsMy SAS Files9.4statmacros"; %imp_scatt_corr_reg_normal (excel_file=data1,excel_sheet=XY_Data, dataset=health,xvar=x, yvar=y); %transformation_box_cox (dataset=health, pre_trans_dataset=health_cox, trans_dataset=health_trans, trans_stdize_dataset=health2,xvar=x, yvar=y); For 3 rd master macro (REGRESSION_WOUT_OUTLIERS), start with a DATASET with transformed Y- and standardized X-variables. Run this macro first with OB=0, then with OB=obs number(s) to be deleted. In the first run, identify outlier observation number. In second run, slice this observation from the data. Repeat these two steps until desired results are achieved with caution. %regression_wout_outliers (dataset=health2, indata=health2, ob=0, sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y); **From this run, it is clear that ob=43 is an outlier; **For improvement, slice ob=43 from health2; %regression_wout_outliers (dataset=health2, indata=health2, ob=43, sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y); %regression_wout_outliers (dataset=health3, indata=health3, ob=0, sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y); **From this run, it is clear that ob=10 is an outlier; **For improvement, slice ob=10 from health3; %regression_wout_outliers (dataset=health3, indata=health3, ob=10, sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);
  • 39. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 39 %regression_wout_outliers (dataset=health4, indata=health4, ob=0, sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y); **From this run, it is clear that ob=28 is an outlier; **For improvement, slice ob=28 from health4; %regression_wout_outliers (dataset=health4, indata=health4, ob=28, sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y); **Run again after satisfaction and use model parameters for final use; %regression_wout_outliers (dataset=health5, indata=health5, ob=0, sliced_data=sliced3, outdata=health6, xvar=x , yvar=neg_1_y); Optionally, after above analysis, the relationship between X and Y can be visualized by calling macro ā€˜SCATTER_CORRā€™ again for transformed datasets (HEALTH2 and HEALTH5). %scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y); %scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y); Conclusion In this paper, a simple linear regression model is developed for X- and Y-variables after normalizing replicated raw data in a systematic manner. Various statistical methods, SAS data step programs and SAS SQL procedures are employed to achieve this goal. PROC SQL is effectively utilized in place of several data step programs. By bringing SAS macro language on the board, number of SAS statements required to perform each repeatable task is reduced to a bare minimum. Furthermore, defined macros are effectively utilized to analyze similar data without much hard coding within a short period of time. References Box, G. E. P. and Cox, D. R. 1964. An analysis of transformations, Journal of the Royal Statistical Society (With discussion), Series B 26 (2): 211ā€“252. Buthmann A. Making Data Normal Using Box-Cox Power Transformation, iSix Sigma. Available at http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/ Carpenter, Art. 2004. Carpenterā€™s Complete Guide to the SAS Ā® Macro Language, Second Edition, SASĀ® Institute Inc., Cary, NC, USA. Lafler, Kirk Paul. 2013. PROC SQL: Beyond the Basics Using SAS Ā® , Second Edition, SAS Ā® Institute Inc., Cary, NC, USA. SAS Ā® 9.4 Product Documentation, SAS Institute Inc., Cary, NC, USA. Available at http://support.sas.com/documentation/94/index.html SAS/STAT Ā® 9.3 User's Guide, SAS Institute Inc., Cary, NC, USA. Available at http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#intro_toc.htm SAS Ā® 9.2 Macro Language: Reference, SAS Institute Inc., Cary, NC, USA. Available at http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/default/viewer.htm#titlepage.htm SAS Ā® 9.3 SQL Procedure Userā€™s Guide, SAS Institute Inc., Cary, NC, USA. Available at http://support.sas.com/documentation/cdl/en/sqlproc/63043/HTML/default/viewer.htm#titlepage.htm Acknowledgments I would like to thank the organizers for giving me an opportunity to present this paper at Ohio SASĀ® Users Conference Hosted by CinSUG, CoSUG and CleveSUG on June 1, 2015 at Kingsgate Marriott Conference Center at the University of Cincinnati, Cincinnati, Ohio.
  • 40. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 40 Trademark Citations SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Ā® indicates USA registration. Author Biography Venu Perla, Ph.D. is a biomedical researcher with about 14 years of research and teaching experience in an academic environment. He is currently working in West Virginia. He served the Purdue University, Oregon Health & Science University, Colorado State University, Kerala Agricultural University (India) and Mangalayatan University (India) at different capacities. Dr. Perla has published 13 peer reviewed research papers and 2 book chapters, obtained 1 international patent (on orthopaedic implant device), gave 7 talks and presented 18 posters at national and international scientific conferences in his professional career. He was trained in clinical trials and clinical data management. He was also trained in advanced SASĀ® programming and clinical biostatistics at the University of California, San Diego. Currently, he is actively employing SASĀ® programming techniques in his research data analysis. Contact Information Phone (Cell): (304) 545-5705 Email: venuperla@yahoo.com LinkedIn: https://www.linkedin.com/pub/venu-perla/2a/700/468
  • 41. Ohio SASĀ® Users Conference June 1, 2015, Cincinnati, Ohio, USA 41 Appendix Table 1. XY_Data sheet of data1.xls (Microsoft Excel 97-2003 file). X Y 0.4 0.4 0.6 0.5 2.2 15.3 0.4 0.7 0.1 0.5 0.7 0.6 2.5 1.1 0.4 0.5 0.5 0.6 1.3 0.9 0.4 0.4 1.8 1.6 0.5 1.8 0.5 0.5 0.7 0.7 0.3 0.7 1.4 0.9 0.8 0.6 1.3 1 0.6 0.6 1.2 1 2 2.1 0.7 0.6 1.3 1.1 1.1 1 2 1.3 0.6 0.7 2.1 1.7 1.8 1.4 1.2 0.8 1 0.7 2.1 1.5 1.4 1 0.7 0.8 0.5 0.5 0.9 0.7 1.2 0.5 1.1 0.7 2.5 2 1 0.7 0.9 0.8 3 2.7 4.2 1.5 0.9 1 1.9 1.6 1 0.8 1.2 0.7 0.8 0.7 1.4 0.8 1.4 1.4 1 1