Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time.
Ohio SAS Users Conference Presentation on Linear Regression
1. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
1
How PROC SQL and SASĀ®
Macro Programming Made My Statistical Analysis
Easy? A Case Study on Linear Regression
Venu Perla Ph.D., Independent SAS Programmer, Cross Lanes, WV 25313
Abstract
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS
programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to
show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for
prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how
SAS macros are created on these programs and used for routine analysis of similar data without any further hard
coding in a short period of time.
Introduction
This paper exploited a raw data on two interrelated plant metabolites (X and Y) for generating a linear regression
model. There are 51 observations in this replicated data. This analysis is carried out by SAS
Ā®
9.4 software with
windows operating system. Code is also tested with SAS
Ā®
Studio software 3.1. Code generated HTML-results are
presented here with the STYLE option āHTMLBlueā.
Importing Data from Excel
Data used in this paper is imported from a sheet (XY_Data) of Microsoft
Ā®
Office Excel 97-2003 file (data1.xls) (see
Appendix). PROC IMPORT is utilized to import āXY_Dataā and renamed it as āHEALTHā (Table 1). Macro variable,
āPATHā is created for Excel file path. While calling the FILE statement, a period (ā.ā) is used at the end of this macro
variable to avoid misinterpretation by the macro facility. File extension and DBMS statement in the code may be
modified according to the Excel version used.
%let path= C:UsersPerlaDesktop;
title "Importing data from excel";
proc import file="&path.data1.xls"
out=health replace
dbms=xls;
sheet=XY_Data;
getnames=yes;
run;
title "Checking imported data";
proc print data=health;
run;
2. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
2
Macro āEXCEL_IMPORTā is defined below for above code for importing Excel files. Where, āEXCEL_FILE=ā is name
of the excel file to be used; āEXCEL_SHEET=ā is name of the excel sheet to be imported; and āDATASET=ā is name of
the output dataset.
%macro excel_import (excel_file= , excel_sheet= , dataset= );
title "Importing data from excel";
proc import file="&path.&excel_file..xls"
out=&dataset replace
dbms=xls;
sheet=&excel_sheet;
getnames=yes;
run;
title "Dataset from imported excel data";
proc print data=&dataset;
run;
%mend excel_import;
Defined macros are saved in a single folder (STATMACROS) for future use.
C:UsersPerlaDocumentsMy SAS Files9.4statmacros
For importing XY_Data, macro EXCEL_IMPORT can be called by following code after specifying location of the
stored macros under global OPTIONS statement. MPRINT, MLOGIC and SYMBOLGEN are the global OPTIONS for
debugging the code.
options mprint mlogic symbolgen;
options mautosource sasautos=
"C:UsersPerlaDocumentsMy SAS Files9.4statmacros";
%excel_import (excel_file=data1, excel_sheet=XY_Data, dataset=health);
Preliminary Analysis of Data
Relationship between X- and Y-variables can be visualized using PROC SGPLOT and PROC CORR.
3. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
3
ods graphics on;
title "Scatter plot of X and Y";
proc sgplot data= health;
scatter x=x y=y;
run;
title "Correlation between X and Y";
proc corr data = health;
var x y;
run;
ods graphics off;
Scatter plot of X and Y indicates that there is no clear relationship between these two variables (Figure 1). Results on
Pearson correlation coefficients indicate a weak correlation between X- and Y-variables (Table 2).
Macro āSCATTER_CORRā is defined below for above code. Where, āDATASET=ā is name of the dataset to be used
for analysis; and āXVAR=ā and āYVAR=ā are the names of the X- and Y-variables, respectively.
%macro scatter_corr (dataset= , xvar= , yvar= );
ods graphics on;
title "Scatter plot of &xvar and &yvar";
proc sgplot data= &dataset;
scatter x=&xvar y=&yvar;
run;
4. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
4
title "Correlation between &xvar and &yvar";
proc corr data = &dataset;
var &xvar &yvar;
run;
ods graphics off;
%mend scatter_corr;
Macro āSCATTER_CORRā can be invoked by following statement for the dataset āHEALTHā:
%scatter_corr (dataset=health, xvar=x, yvar=y);
There is an indication of a weak correlation between X and Y (Pearson correlation coefficient: 0.35). Further analysis
is carried out on this raw data using PROC REG and PROC UNIVARIATE. LACKFIT option of MODEL statement in
PROC REG determines whether this linear model is a good fit for this replicated data or not? Residual analysis and
normality tests are carried out using PROC UNIVARIATE with NORMAL option.
ODS graphics on;
title "Regression analysis";
proc reg data = health plots(only)=diagnostics (unpack);
model y = x/lackfit;
output out =mdlres r=resid;
run;
ODS graphics off;
proc univariate data= mdlres normal;
var resid;
run;
Analysis of variance indicates that LACK OF FIT for the linear model is significant (Table 3). This suggests that
further in-depth analysis has to be carried out on this raw data before rejecting the model.
Parameter estimates and adjusted R
2
value for the raw data are provided in Table 4A and 4B, respectively. Adjusted
R
2
value is negligible (0.11).
5. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
5
Distribution of residuals for Y is not normal for the raw data (Figure 2).
Furthermore, observed by predicted plot for Y indicates that all the observations are crowded in the lower left corner
of the plot (Figure 3).
6. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
6
Q-Q plot of residuals for Y further confirms that the raw data is not normal (Figure 4).
This raw data is skewed (Table 5), and significant p values for four tests of normality are the true testimony of non-
normal distribution of data (Table 6).
7. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
7
Macro āREG_NORMALITYā is defined below for regression analysis and normality tests described above. Where,
āDATASET=ā is name of the dataset to be used for analysis; and āXVAR=ā and āYVAR=ā are names of the X- and Y-
variables, respectively.
%macro reg_normality (dataset= ,xvar= ,yvar= );
ODS graphics on;
title "Regression analysis: Dataset &dataset";
proc reg data = &dataset plots(only)=diagnostics (unpack);
model &yvar = &xvar/lackfit;
output out =mdlres r=resid;
run;
proc univariate data= mdlres normal;
var resid;
run;
ODS graphics off;
%mend reg_normality;
Macro āREG_NORMALITYā can be called by following statement for the dataset āHEALTHā:
%reg_normality (dataset=health, xvar=x, yvar=y);
Preliminary Data Transformation
Box-Cox power transformation can be adopted to normalize this raw data. Data should be converted to non-zero and
non-negative values before testing for Box-Cox power transformation. Following code transforms X and Y variables
into non-zero and/or non- negative variables only when ā0ā or negative values are encountered in the data.
PROC SQL is used to transform X- and Y-variable data into non-zero and non-negative data. Table HEALTH_COX is
created from dataset HEALTH in this procedure. Proc SQL reproduced original data as there are no zeros and
negative values (Table 7; Log 1).
8. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
8
title "Transforming X and Y values into non-zero and non-negative values";
proc sql;
create table health_cox as
select case
when min(x) <=0 then (-(min(x))+x+1)
else x
end as X,
case
when min(y) <=0 then (-(min(y))+y+1)
else y
end as Y
from health;
quit;
proc print data=health_cox;
run;
Macro āTRANSFORM_ZERO_NEGā is defined below for above PROC SQL code. Where, āDATASET=ā is the name of
the input dataset to be used for transforming X- and Y-values; āXVAR=ā and āYVAR=ā are names of the X- and Y-
variables to be transformed, respectively; and āPRE_TRANS_DATASET=ā is name of the output dataset to be created
with transformed X- and Y-variables.
%macro transform_zero_neg (dataset= ,xvar= ,yvar= ,pre_trans_dataset=);
title "Transforming &xvar and &yvar values into non-zero and non-negative
values";
proc sql;
create table &pre_trans_dataset as
select case
when min(&xvar) <=0 then (-(min(&xvar))+&xvar+1)
else &xvar
end as &xvar,
case
when min(&yvar) <=0 then (-(min(&yvar))+&yvar+1)
else &yvar
end as &yvar
from &dataset;
quit;
proc print data=&pre_trans_dataset;
run;
9. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
9
%mend transform_zero_neg;
Marco āTRANSFORM_ZERO_NEGā can be invoked by following statement for the dataset āHEALTHā:
%transform_zero_neg
(dataset=health,xvar=x,yvar=y,pre_trans_dataset=health_cox);
Box-Cox Power Transformation
Box-Cox power transformation on non-zero and non-negative data is performed using PROC TRANSREG with ODS
GRAPHICS on.
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= health_cox;
model boxcox(y) = identity(x);
run;
ods graphics off;
Above code generated Box-Cox analysis for Y (Figure 6). Selected lambda (-0.75 at 95% CI) is the exponent to be
used to transform the data into normal shape.
In order to get convenient lambda value, above SAS code is executed without ODS GRAPHICS statement.
10. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
10
proc transreg data = health_cox;
model boxcox(y)=identity(x);
run;
This code generated best lambda, lambda with 95% confidence interval and convenient lambda (Table 8).
Convenient lambda is used for transforming Y-variable in this analysis.
Macro āBOX_COX_LAMBDAā is defined below for above codes. Where, āPRE_TRANS_DATASET=ā is name of the
input dataset with non-zero and non-negative values; and āXVAR=ā and āYVAR=ā are names of the X- and Y-variables,
respectively.
%macro box_cox_lambda (pre_trans_dataset= ,xvar= ,yvar= );
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= &pre_trans_dataset;
model boxcox(&yvar) = identity(&xvar);
run;
ods graphics off;
proc transreg data = &pre_trans_dataset;
model boxcox(&yvar)=identity(&xvar);
run;
%mend box_cox_lambda;
Macro āBOX_COX_LAMBDAā can be called by following statement for the dataset āHEALTH_COXā:
%box_cox_lambda (pre_trans_dataset=health_cox, xvar=x ,yvar=y);
DATA STEP program is used to transform Y-variable. Code for common convenient lambda values (-2, -1, -0.5, 0,
0.5, 1 and 2); respective Y-transformations (1/Y
2
, 1/Y, 1/sqrt (Y), log (Y), sqrt (Y), Y and Y
2
); and respective
transformed-Y variable names (neg_2_y, neg_1_y, neg_half_y, zero_y, half_y, one_y, and two_y) are incorporated in
the following data step program.
title "Transformation of Y-variable with convenient lambda";
11. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
11
data health_trans_1;
set health_cox;
neg_2_y = 1/(y**2);
neg_1_y = 1/(y**1);
neg_half_y = 1/(sqrt(y));
zero_y = log(y);
half_y = sqrt(y);
one_y = y**1;
two_y = y**2;
run;
proc print data=health_trans_1;
run;
This code generated dataset āHEALTH_TRANS_1ā with new transformed Y-variables (Table 9A).
Alternatively, following PROC SQL code is used to generate same dataset with different name.
title "Transformation of Y-values with convenient lambda";
proc sql;
create table health_trans as
select x, y,
1/(y**2) as neg_2_y,
1/(y**1) as neg_1_y,
1/(sqrt(y)) as neg_half_y,
log(y) as zero_y,
sqrt(y) as half_y,
y**1 as one_y,
y**2 as two_y
from health_cox;
quit;
proc print data=health_trans;
run;
PROC SQL generated āHEALTH_TRANSā table (Table 9B). āneg_1_yā is the corresponding transformed Y-variable for
the convenient lambda -1. This āneg_1_yā variable is used for further analysis.
12. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
12
Datasets obtained with DATA STEP program and PROC SQL code are compared with PROC COMPARE. Both the
datasets are equal in all aspects (Log 2).
title "Comparison of output obtained by DATA step and PROC SQL methods";
proc compare
base=health_trans
compare=health_trans_1
printall;
run;
Macro āTRANSFORM_LAMBDAā is defined below for above PROC SQL code. Where, āPRE_TRANS_DATASET=ā is
name of the input dataset with non-zero and non-negative X- and Y-values; āXVAR=ā and āYVAR=ā are names of the
X- and Y-variables, respectively; and āTRANS_DATASET=ā is name of the output dataset with transformed data.
%macro transform_lambda (pre_trans_dataset= ,xvar= ,yvar= ,trans_dataset= );
title "Transformation of &yvar.-values with convenient lambda";
proc sql;
create table &trans_dataset as
select &xvar, &yvar,
1/(&yvar**2) as neg_2_&yvar,
1/(&yvar**1) as neg_1_&yvar,
1/(sqrt(&yvar)) as neg_half_&yvar,
log(&yvar) as zero_&yvar,
sqrt(&yvar) as half_&yvar,
&yvar**1 as one_&yvar,
&yvar**2 as two_&yvar
from &pre_trans_dataset;
quit;
proc print data=&trans_dataset;
run;
%mend transform_lambda;
Macro āTRANSFORM_LAMBDAā can be invoked by following code for the dataset āHEALTH_COXā:
%transform_lambda (pre_trans_dataset=health_cox, xvar=x, yvar=y,
trans_dataset=health_trans);
13. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
13
Standardization of X-variable
After transformation of Y-variable, in order to obtain meaningful Y-intercept, X-variable is standardized using PROC
STDIZE. Dataset āHEALTH2ā is generated from table āHEALTH_TRANSā in this procedure. OPREFIX option is used
to prefix the original X-variable name with the word, āUnstdized_ā. On the other hand, standardized X-values are
stored under X.
title "Standardized X-variable after Y-transformation";
proc stdize data=health_trans
oprefix=Unstdized_
method=mean
out=health2;
var x;
run;
proc print data=health2;
run;
Generated dataset āHEALTH2ā is shown below with standardized X-variable in the last column as X (Table 10).
Macro āSTDIZE_Xā is defined below for above code. Where, āTRANS_DATASET=ā is name of the input dataset;
āTRANS_STDIZE_DATASET=ā is name of the output dataset; and āXVAR=ā is name of the X-variable to be
standardized.
%macro stdize_x (trans_dataset= ,trans_stdize_dataset= ,xvar= );
title "Standardized &xvar.-variable after Y-transformation";
proc stdize data=&trans_dataset
oprefix=Unstdized_
method=mean
out=&trans_stdize_dataset;
var &xvar;
run;
proc print data=&trans_stdize_dataset;
run;
%mend stdize_x;
Macro āSTDIZE_Xā can be called by following code for the dataset āHEALTH_TRANSā:
%stdize_x (trans_dataset=health_trans, trans_stdize_dataset=health2, xvar=x);
14. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
14
Regression Analysis of Transformed-Standardized Data
Regression analysis and normality tests are performed on the transformed and standardized dataset āHEALTH2ā by
calling previously defined macro āREG_NORMALITYā. Variable X is the standardized X, and āneg_1_yā is the
transformed Y.
%reg_normality (dataset=health2, xvar=x, yvar=neg_1_y);
With transformed data, LACK OF FIT for linear model is turned out to be non-significant, which indicates that the
linear model is acceptable for X and Y (Table 11). Parameter estimates for intercept and X are significant (Table
12A). As compared to the raw data, adjusted R
2
value with transformed data is improved from 0.11 to 0.53 (Table
12B). Other results indicate that transformed data is normally distributed (Figures 6-8; Table 13). Non-significant p-
value with Kolmogorov-Smirnov normality test further confirms that data is normally distributed (Table 14). However,
other tests of normality are still significant, which indicates that there is a room for further improvement of data with
respect to normal distribution. There is at least one outlier and leverage observation that is influencing the normal
distribution (Figure 9).
17. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
17
Outlier and Influential Observations
R, INFLUENCE, RSTUDENTBYLEVERAGE, DFFITS, DFBETAS and COOKSD options are used in the PROC REG
to generate detailed outlier and or influential observations for the dataset āHEALTH2ā (Table 15; Figures 10-13).
ods graphics on;
title "Outlier or Influential observations";
proc reg data= health2 plots (only label)= (rstudentbyleverage dffits dfbetas
cooksd);
model neg_1_y = x/r influence;
run;
ods graphics off;
Highest number of asterisks are seen for observation number 43 (Table 15). This observation turned out to be an
outlier and leverage observation (Figure 11). Other results also support that observation number 43 is an outlier and
influencing observation in the dataset āHEALTH2ā (Figures 10-13).
20. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
20
Macro āOUTLIER_OBSā is defined below for above code. Where, āINDATA=ā is name of the input dataset; and
āXVAR=ā and āYVAR=ā are names of the X- and Y-variables to be used in the analysis, respectively.
%macro outlier_obs (indata= ,xvar= ,yvar= );
ods graphics on;
title "Outlier or influential observations: Dataset &indata";
proc reg data= &indata plots (only label)= (rstudentbyleverage dffits
dfbetas cooksd);
model &yvar = &xvar/r influence;
run;
ods graphics off;
%mend outlier_obs;
Macro āOUTLIER_OBSā can be invoked by following statement for the dataset āHEALTH2ā:
%outlier_obs (indata=health2, xvar=x, yvar=neg_1_y);
Slicing One Outlier Observation
Dataset āSLICEDā for outlier and leverage observation number 43 is created from the dataset āHEALTH2ā using
following data step code. SAS supplied observation numbers are used to identify and generate a dataset for outlier(s)
with this code. Alternatively, WHERE statement can be used in the PROC REG to omit observations while performing
the regression analysis.
title "Dataset for outlier observation(s): sliced";
21. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
21
data sliced;
do slice=43;
set health2 point=slice;
output;
end;
stop;
run;
proc print data=sliced;
run;
Dataset āSLICEDā with one outlier observation is shown below (Table 16):
Macro āSLICE_OBSā is defined below for above code. Where, āINDATA=ā is name of the input dataset; āOB=ā is outlier
observation number; and āSLICED_DATA=ā is name of the output dataset for storing only the outlier observation.
%macro slice_obs (indata= ,ob=0 ,sliced_data= );
title "Dataset for outlier observation(s): &sliced_data";
data &sliced_data;
do slice=&ob;
set &indata point=slice;
output;
end;
stop;
run;
proc print data=&sliced_data;
run;
%mend slice_obs;
Macro āSLICE_OBSā can be called by following statement for the outlier observation number 43 of the dataset
āHEALTH2ā:
%slice_obs (indata=health2, ob=43, sliced_data=sliced);
When observation numbers are not provided, use OB=0 to produce missing values for outlier observations. Further
analysis is not affected by these missing values.
%slice_obs (indata=health2, ob=0, sliced_data=sliced);
22. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
22
Dataset without One Outlier Observation
Following DATA STEP program is used to generate dataset āHEALTH3_1ā with all the observations of the dataset
āHEALTH2ā except one that matches with the outlier observation of the dataset āSLICEDā (Table 17A). Here, data has
to be sorted before merging. Note that number of observations in the output dataset āHEALTH3_1ā are 50 only. Total
real and CPU time required for PROC SORT and DATA STATEMENTS are 0.07 and 0.06 seconds, respectively (Log
3A).
title "Sorting datasets before merging";
proc sort data=health2;
by unstdized_x y;
run;
title "Dataset without outlier observation(s)";
Data health3_1;
merge health2 (in= inhealth)
sliced (in=insliced);
by unstdized_x y;
if inhealth ^= insliced;
run;
proc print data=health3_1;
run;
Alternatively, following PROC SQL code can be used to produce same output in the form of table āHEALTH3ā (Table
17B). Unlike data step program, merging datasets can be done without sorting in PROC SQL. Real and CPU time
required for PROC SQL is 0.01 and 0.03 seconds, respectively (Log 3B). In other words, PROC SQL code is shorter
and quicker than DATA STEP program in this example.
title "Dataset without outlier observation(s)";
proc sql;
23. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
23
create table health3 as
select* from health2
except all
select* from sliced;
quit;
proc print data= health3;
run;
Output of DATA STEP program and PROC SQL are compared and verified with PROC COMPARE. All the values in
the datasets āHEALTH3_1ā and āHEALTH3ā are exactly equal in all respects (Log 3C).
title "Comparison of datasets: Data step program vs PROC SQL";
proc compare
base=health3
compare=health3_1
printall;
run;
Macro āNO_OUTLIER_DATAā is defined below for the above PROC SQL code. Where, āINDATA=ā is name of the
dataset with all the observations; āSLICED_DATA=ā is name of the dataset with only outlier observations; and
āOUTDATA=ā is name of the output dataset with all the observations except outliers.
%macro no_outlier_data (indata= ,sliced_data= ,outdata=);
title "&outdata.: Dataset without outlier observation(s)";
proc sql;
24. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
24
create table &outdata as
select* from &indata
except all
select* from &sliced_data;
quit;
proc print data= &outdata;
run;
%mend no_outlier_data;
Macro āNO_OUTLIER_DATAā can be invoked by following statement for the datasets āHEALTH2ā and āSLICEDā:
%no_outlier_data (indata=health2, sliced_data=sliced, outdata=health3);
Regression Analysis without One Outlier Observation
Regression analysis and normality tests are performed on dataset āHEALTH3ā by invoking previously defined macro,
āREG_NORMALITYā. Dataset āHEALTH3ā is devoid of one outlier observation. X is the standardized X-variable, and
āneg_1_yā is the transformed Y-variable.
%reg_normality (dataset=health3, xvar=x, yvar=neg_1_y);
Since LACK OF FIT is non-significant, the linear model for X and Y can be accepted for prediction (Table 18). In the
absence of one outlier observation, as compared to the previous regression analysis, adjusted R
2
value in this
analysis is increased from 0.53 to 0.61 (Tables 19B). Parameter estimates are also modified (Tables 19A). Results
suggest that data is normal (Figures 14 and 15; Table 20). Further improvement may be possible as Shapiro-Wilk
test, one among four tests of the normality, is still significant. Care should be taken to avoid elimination of more
number of observations while improving normal distribution of the data.
26. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
26
Regression Analysis without Second Outlier Observation
Previously defined macros come in handy while performing this task. No more hard coding is required for this task.
Following macros are invoked to complete this task.
%outlier_obs (indata=health3, xvar=x, yvar=neg_1_y);
%slice_obs (indata=health3, ob=10,sliced_data=sliced2);
%no_outlier_data (indata=health3, sliced_data=sliced2, outdata=health4);
%reg_normality (dataset=health4, xvar=x, yvar=neg_1_y);
Outlier and influential observations are produced by calling macro āOUTLIER_OBSā. Observation number 10 with
highest asterisks in dataset āHEALTH3ā is identified as an outlier (Table 21; Figure 16).
27. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
27
Macro āSLICE_OBSā stored one outlier observation (obs # 10) in a dataset named āSLICED2ā (Table 22).
Table āHEALTH4ā without second outlier observation is generated by calling macro āNO_OUTLIER_DATAā (Table 23).
Note that total observations in the dataset āHEALTH4ā are reduced to 49 (Table 23; Log 4).
Regression analysis and normality tests are invoked by macro āREG_NORMALITYā on dataset āHEALTH4ā. Similar to
the earlier results, further improvement was achieved in parameter estimates and adjusted R
2
(0.7). Data is more
28. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
28
normal than previous one (Tables 24, 25A and 25B; Figures 17 and 18). However, Shapiro-Wilk normality test is still
significant (Table 26).
29. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
29
Regression Analysis without Third Outlier Observation
For the sake of exploration, further analysis is carried out to eliminate third outlier observation in the data.
Interestingly, invoking macro āOUTLIER_OBSā on āHEALTH4ā dataset produced two conflicting outlier observations
(obs # 47 and obs # 28) (Table 27; Figures 19-22). For this reason, further analysis with other macros (SLICE_OBS,
NO_OUTLIER_DATA, and REG_NORMALITY) is carried out separately for both, observation number 47 and 28.
%outlier_obs (indata=health4, xvar=x, yvar=neg_1_y);
32. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
32
Analysis without outlier observation number 47:
%slice_obs (indata=health4, ob=47,sliced_data=sliced3);
%no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5);
%reg_normality (dataset=health5, xvar=x, yvar=neg_1_y);
Elimination of observation number 47 did not improve the status of normality tests (Table 28). Shapiro-Wilk test is still
significant.
Analysis without outlier observation number 28:
%slice_obs (indata=health4, ob=28,sliced_data=sliced3);
%no_outlier_data (indata=health4, sliced_data=sliced3, outdata=health5);
%reg_normality (dataset=health5, xvar=x, yvar=neg_1_y);
33. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
33
Dataset āSLICED3ā for outlier observation number 28 is generated by invoking macro āSLICE_OBSā (Log 5). Note that
dataset āHEALTH5ā contained 48 observations only (Log 6). By eliminating observation number 28, all the four tests
of normality are now non-significant (Table 29). Other data supports these results (Figures 23 and 24). Residual-fit
spread plot indicates accountability of the X-variable for the variation in the model (Figure 25). Data pertaining to
analysis of variance, parameter estimates and adjusted R
2
are presented in Table 30, 31A and 31B, respectively.
Like previous analysis, LACK OF FIT for model is non-significant (Table 30). Adjusted R
2
value is further improved to
0.73.
35. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
35
Linear Regression Models
From the above analysis, linear regression models for raw, normalized, and normalized data without 1 to 3 outlier
observations are given below. After normalization, data started to exhibit true relationship between X and Y. One
should consider several other factors before proceeding to eliminate outlier observations.
Raw data (non-normal): Y = 0.124 + 0.933X (Adjusted R
2
: 0.11)
Normalized data: Y = 1.240 ā 0.514X (Adjusted R
2
: 0.53)
Normalized data without 1 outlier observation: Y = 1.213 ā 0.653X (Adjusted R
2
: 0.61)
Normalized data without 2 outlier observations: Y = 1.234 ā 0.689X (Adjusted R
2
: 0.70)
Normalized data without 3 outlier observations: Y = 1.218 ā 0.690X (Adjusted R
2
: 0.73)
Scatter Plots after Normalization
Optionally, relationship between X and Y can be visualized by calling macro āSCATTER_CORRā again for
transformed dataset āHEALTH2ā and final dataset āHEALTH5ā.
Analysis of dataset āHEALTH2ā:
%scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);
36. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
36
Analysis of dataset āHEALTH5ā:
%scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y);
37. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
37
Pearson correlation coefficient of non-normal raw data is 0.35 (Table 2). Unlike raw data, both the datasets,
āHEALTH2ā and āHEALTH5ā are normal and exhibited strong similar negative relationship between X and Y with
Pearson correlation coefficient values between -0.70 and -0.90 (Figures 26 and 27; Tables 32 and 33).
Master Macros
Three master macros are created for this whole analysis. Upon invoking, these master macros call other macros that
are previously defined.
Master macro 1: IMP_SCATT_CORR_REG_NORMAL
This macro is for initial set of operations (data import from excel file, scatter plot, correlation, regression analysis and
normality tests). There are three macros (EXCEL_IMPORT, SCATTER_CORR and REG_NORMALITY) within this
master macro. All the keyword parameters are described above for each macro. Code for macro EXCEL_IMPORT
may be modified according to the version of excel file.
%macro imp_scatt_corr_reg_normal (excel_file= ,excel_sheet= , dataset= ,xvar= ,
yvar=);
%excel_import (excel_file=&excel_file, excel_sheet=&excel_sheet,
dataset=&dataset);
%scatter_corr (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%mend imp_scatt_corr_reg_normal;
Master macro 2: TRANSFORMATION_BOX_COX
This macro is for transformation of data if it is not normal (conditions apply). There are four macros
(TRANSFORM_ZERO_NEG, BOX_COX_LAMBDA, TRANSFORM_LAMBDA and STDIZE_X) within this master
macro. All the keyword parameters are described above for each macro.
%macro transformation_box_cox (dataset= , pre_trans_dataset= , trans_dataset= ,
trans_stdize_dataset= , xvar= , yvar=);
%transform_zero_neg (dataset=&dataset, xvar=&xvar, yvar=&yvar,
pre_trans_dataset=&pre_trans_dataset);
%box_cox_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar);
%transform_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar, trans_dataset=&trans_dataset);
38. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
38
%stdize_x (trans_dataset=&trans_dataset,
trans_stdize_dataset=&trans_stdize_dataset, xvar=&xvar);
%mend transformation_box_cox;
Master macro 3: REGRESSION_WOUT_OUTLIERS
This macro is for identification and elimination of outlier observations in the data. It utilizes outlier free data for
regression analysis. There are four macros (REG_NORMALITY, OUTLIER_OBS, SLICE_OBS and
NO_OUTLIER_DATA) within this master macro. All the keyword parameters are described above for each macro.
%macro regression_wout_outliers (dataset= , indata= , ob= ,sliced_data= ,
outdata= ,xvar= , yvar=);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%outlier_obs (indata=&indata, xvar=&xvar, yvar=&yvar);
%slice_obs (indata=&indata, ob=&ob, sliced_data=&sliced_data);
%no_outlier_data (indata=&indata, sliced_data=&sliced_data,
outdata=&outdata);
%mend regression_wout_outliers;
Now, whole analysis can be performed on same (or similar) type of data without any further hard coding in a short
period of time by calling master macros in the following manner. It is important to mention location of stored macros
before calling them.
options mautosource sasautos="C:UsersPerlaDocumentsMy SAS
Files9.4statmacros";
%imp_scatt_corr_reg_normal (excel_file=data1,excel_sheet=XY_Data,
dataset=health,xvar=x, yvar=y);
%transformation_box_cox (dataset=health, pre_trans_dataset=health_cox,
trans_dataset=health_trans, trans_stdize_dataset=health2,xvar=x, yvar=y);
For 3
rd
master macro (REGRESSION_WOUT_OUTLIERS), start with a DATASET with transformed Y- and
standardized X-variables. Run this macro first with OB=0, then with OB=obs number(s) to be deleted. In the first run,
identify outlier observation number. In second run, slice this observation from the data. Repeat these two steps until
desired results are achieved with caution.
%regression_wout_outliers (dataset=health2, indata=health2, ob=0,
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=43 is an outlier;
**For improvement, slice ob=43 from health2;
%regression_wout_outliers (dataset=health2, indata=health2, ob=43,
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
%regression_wout_outliers (dataset=health3, indata=health3, ob=0,
sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=10 is an outlier;
**For improvement, slice ob=10 from health3;
%regression_wout_outliers (dataset=health3, indata=health3, ob=10,
sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);
39. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
39
%regression_wout_outliers (dataset=health4, indata=health4, ob=0,
sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=28 is an outlier;
**For improvement, slice ob=28 from health4;
%regression_wout_outliers (dataset=health4, indata=health4, ob=28,
sliced_data=sliced2, outdata=health5, xvar=x , yvar=neg_1_y);
**Run again after satisfaction and use model parameters for final use;
%regression_wout_outliers (dataset=health5, indata=health5, ob=0,
sliced_data=sliced3, outdata=health6, xvar=x , yvar=neg_1_y);
Optionally, after above analysis, the relationship between X and Y can be visualized by calling macro
āSCATTER_CORRā again for transformed datasets (HEALTH2 and HEALTH5).
%scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);
%scatter_corr (dataset=health5, xvar=x, yvar=neg_1_y);
Conclusion
In this paper, a simple linear regression model is developed for X- and Y-variables after normalizing replicated raw
data in a systematic manner. Various statistical methods, SAS data step programs and SAS SQL procedures are
employed to achieve this goal. PROC SQL is effectively utilized in place of several data step programs. By bringing
SAS macro language on the board, number of SAS statements required to perform each repeatable task is reduced
to a bare minimum. Furthermore, defined macros are effectively utilized to analyze similar data without much hard
coding within a short period of time.
References
Box, G. E. P. and Cox, D. R. 1964. An analysis of transformations, Journal of the Royal Statistical Society (With
discussion), Series B 26 (2): 211ā252.
Buthmann A. Making Data Normal Using Box-Cox Power Transformation, iSix Sigma. Available at
http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/
Carpenter, Art. 2004. Carpenterās Complete Guide to the SAS
Ā®
Macro Language, Second Edition, SASĀ® Institute
Inc., Cary, NC, USA.
Lafler, Kirk Paul. 2013. PROC SQL: Beyond the Basics Using SAS
Ā®
, Second Edition, SAS
Ā®
Institute Inc., Cary, NC,
USA.
SAS
Ā®
9.4 Product Documentation, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/94/index.html
SAS/STAT
Ā®
9.3 User's Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#intro_toc.htm
SAS
Ā®
9.2 Macro Language: Reference, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/default/viewer.htm#titlepage.htm
SAS
Ā®
9.3 SQL Procedure Userās Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/sqlproc/63043/HTML/default/viewer.htm#titlepage.htm
Acknowledgments
I would like to thank the organizers for giving me an opportunity to present this paper at Ohio SASĀ® Users
Conference Hosted by CinSUG, CoSUG and CleveSUG on June 1, 2015 at Kingsgate Marriott Conference Center at
the University of Cincinnati, Cincinnati, Ohio.
40. Ohio SASĀ®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
40
Trademark Citations
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. Ā® indicates USA registration.
Author Biography
Venu Perla, Ph.D. is a biomedical researcher with about 14 years of research and teaching
experience in an academic environment. He is currently working in West Virginia. He served the
Purdue University, Oregon Health & Science University, Colorado State University, Kerala
Agricultural University (India) and Mangalayatan University (India) at different capacities. Dr.
Perla has published 13 peer reviewed research papers and 2 book chapters, obtained 1
international patent (on orthopaedic implant device), gave 7 talks and presented 18 posters at
national and international scientific conferences in his professional career. He was trained in
clinical trials and clinical data management. He was also trained in advanced SASĀ® programming
and clinical biostatistics at the University of California, San Diego. Currently, he is actively employing SASĀ®
programming techniques in his research data analysis.
Contact Information
Phone (Cell): (304) 545-5705
Email: venuperla@yahoo.com
LinkedIn: https://www.linkedin.com/pub/venu-perla/2a/700/468