Ohio SAS Users Conference Presentation on Linear Regression

Ohio SAS®
Users Conference
June 1, 2015, Cincinnati, Ohio, USA
1
How PROC SQL and SAS®
Macro Programming Made My Statistical Analysis
Easy? A Case Study on Linear Regression
Venu Perla Ph.D., Independent SAS Programmer, Cross Lanes, WV 25313
Abstract
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS
programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to
show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for
prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how
SAS macros are created on these programs and used for routine analysis of similar data without any further hard
coding in a short period of time.
Introduction
This paper exploited a raw data on two interrelated plant metabolites (X and Y) for generating a linear regression
model. There are 51 observations in this replicated data. This analysis is carried out by SAS
®
9.4 software with
windows operating system. Code is also tested with SAS
®
Studio software 3.1. Code generated HTML-results are
presented here with the STYLE option ‘HTMLBlue’.
Importing Data from Excel
Data used in this paper is imported from a sheet (XY_Data) of Microsoft
®
Office Excel 97-2003 file (data1.xls) (see
Appendix). PROC IMPORT is utilized to import ‘XY_Data’ and renamed it as ‘HEALTH’ (Table 1). Macro variable,
‘PATH’ is created for Excel file path. While calling the FILE statement, a period (‘.’) is used at the end of this macro
variable to avoid misinterpretation by the macro facility. File extension and DBMS statement in the code may be
modified according to the Excel version used.
%let path= C:UsersPerlaDesktop;
title "Importing data from excel";
proc import file="&path.data1.xls"
out=health replace
dbms=xls;
sheet=XY_Data;
getnames=yes;
run;
title "Checking imported data";
proc print data=health;
run;

Ohio SAS®
Users Conference
2
Macro ‘EXCEL_IMPORT’ is defined below for above code for importing Excel files. Where, ‘EXCEL_FILE=’ is name
of the excel file to be used; ‘EXCEL_SHEET=’ is name of the excel sheet to be imported; and ‘DATASET=’ is name of
the output dataset.
%macro excel_import (excel_file= , excel_sheet= , dataset= );
title "Importing data from excel";
proc import file="&path.&excel_file..xls"
out=&dataset replace
dbms=xls;
sheet=&excel_sheet;
getnames=yes;
run;
title "Dataset from imported excel data";
proc print data=&dataset;
run;
%mend excel_import;
Defined macros are saved in a single folder (STATMACROS) for future use.
C:UsersPerlaDocumentsMy SAS Files9.4statmacros
For importing XY_Data, macro EXCEL_IMPORT can be called by following code after specifying location of the
stored macros under global OPTIONS statement. MPRINT, MLOGIC and SYMBOLGEN are the global OPTIONS for
debugging the code.
options mprint mlogic symbolgen;
options mautosource sasautos=
"C:UsersPerlaDocumentsMy SAS Files9.4statmacros";
%excel_import (excel_file=data1, excel_sheet=XY_Data, dataset=health);
Preliminary Analysis of Data
Relationship between X- and Y-variables can be visualized using PROC SGPLOT and PROC CORR.

Ohio SAS®
Users Conference
3
ods graphics on;
title "Scatter plot of X and Y";
proc sgplot data= health;
scatter x=x y=y;
run;
title "Correlation between X and Y";
proc corr data = health;
var x y;
run;
ods graphics off;
Scatter plot of X and Y indicates that there is no clear relationship between these two variables (Figure 1). Results on
Pearson correlation coefficients indicate a weak correlation between X- and Y-variables (Table 2).
Macro ‘SCATTER_CORR’ is defined below for above code. Where, ‘DATASET=’ is name of the dataset to be used
for analysis; and ‘XVAR=’ and ‘YVAR=’ are the names of the X- and Y-variables, respectively.
%macro scatter_corr (dataset= , xvar= , yvar= );
ods graphics on;
title "Scatter plot of &xvar and &yvar";
proc sgplot data= &dataset;
scatter x=&xvar y=&yvar;
run;

Ohio SAS®
Users Conference
4
title "Correlation between &xvar and &yvar";
proc corr data = &dataset;
var &xvar &yvar;
run;
ods graphics off;
%mend scatter_corr;
Macro ‘SCATTER_CORR’ can be invoked by following statement for the dataset ‘HEALTH’:
%scatter_corr (dataset=health, xvar=x, yvar=y);
There is an indication of a weak correlation between X and Y (Pearson correlation coefficient: 0.35). Further analysis
is carried out on this raw data using PROC REG and PROC UNIVARIATE. LACKFIT option of MODEL statement in
PROC REG determines whether this linear model is a good fit for this replicated data or not? Residual analysis and
normality tests are carried out using PROC UNIVARIATE with NORMAL option.
ODS graphics on;
title "Regression analysis";
proc reg data = health plots(only)=diagnostics (unpack);
model y = x/lackfit;
output out =mdlres r=resid;
run;
ODS graphics off;
proc univariate data= mdlres normal;
var resid;
run;
Analysis of variance indicates that LACK OF FIT for the linear model is significant (Table 3). This suggests that
further in-depth analysis has to be carried out on this raw data before rejecting the model.
Parameter estimates and adjusted R
2
value for the raw data are provided in Table 4A and 4B, respectively. Adjusted
R
2
value is negligible (0.11).

Ohio SAS®
Users Conference
5
Distribution of residuals for Y is not normal for the raw data (Figure 2).
Furthermore, observed by predicted plot for Y indicates that all the observations are crowded in the lower left corner
of the plot (Figure 3).

Ohio SAS®
Users Conference
6
Q-Q plot of residuals for Y further confirms that the raw data is not normal (Figure 4).
This raw data is skewed (Table 5), and significant p values for four tests of normality are the true testimony of non-
normal distribution of data (Table 6).

Ohio SAS®
Users Conference
7
Macro ‘REG_NORMALITY’ is defined below for regression analysis and normality tests described above. Where,
‘DATASET=’ is name of the dataset to be used for analysis; and ‘XVAR=’ and ‘YVAR=’ are names of the X- and Y-
variables, respectively.
%macro reg_normality (dataset= ,xvar= ,yvar= );
ODS graphics on;
title "Regression analysis: Dataset &dataset";
proc reg data = &dataset plots(only)=diagnostics (unpack);
model &yvar = &xvar/lackfit;
output out =mdlres r=resid;
run;
proc univariate data= mdlres normal;
var resid;
run;
ODS graphics off;
%mend reg_normality;
Macro ‘REG_NORMALITY’ can be called by following statement for the dataset ‘HEALTH’:
%reg_normality (dataset=health, xvar=x, yvar=y);
Preliminary Data Transformation
Box-Cox power transformation can be adopted to normalize this raw data. Data should be converted to non-zero and
non-negative values before testing for Box-Cox power transformation. Following code transforms X and Y variables
into non-zero and/or non- negative variables only when ‘0’ or negative values are encountered in the data.
PROC SQL is used to transform X- and Y-variable data into non-zero and non-negative data. Table HEALTH_COX is
created from dataset HEALTH in this procedure. Proc SQL reproduced original data as there are no zeros and
negative values (Table 7; Log 1).

Ohio SAS®
Users Conference
8
title "Transforming X and Y values into non-zero and non-negative values";
proc sql;
create table health_cox as
select case
when min(x) <=0 then (-(min(x))+x+1)
else x
end as X,
case
when min(y) <=0 then (-(min(y))+y+1)
else y
end as Y
from health;
quit;
proc print data=health_cox;
run;
Macro ‘TRANSFORM_ZERO_NEG’ is defined below for above PROC SQL code. Where, ‘DATASET=’ is the name of
the input dataset to be used for transforming X- and Y-values; ‘XVAR=’ and ‘YVAR=’ are names of the X- and Y-
variables to be transformed, respectively; and ‘PRE_TRANS_DATASET=’ is name of the output dataset to be created
with transformed X- and Y-variables.
%macro transform_zero_neg (dataset= ,xvar= ,yvar= ,pre_trans_dataset=);
title "Transforming &xvar and &yvar values into non-zero and non-negative
values";
proc sql;
create table &pre_trans_dataset as
select case
when min(&xvar) <=0 then (-(min(&xvar))+&xvar+1)
else &xvar
end as &xvar,
case
when min(&yvar) <=0 then (-(min(&yvar))+&yvar+1)
else &yvar
end as &yvar
from &dataset;
quit;
proc print data=&pre_trans_dataset;
run;

Ohio SAS®
Users Conference
9
%mend transform_zero_neg;
Marco ‘TRANSFORM_ZERO_NEG’ can be invoked by following statement for the dataset ‘HEALTH’:
%transform_zero_neg
(dataset=health,xvar=x,yvar=y,pre_trans_dataset=health_cox);
Box-Cox Power Transformation
Box-Cox power transformation on non-zero and non-negative data is performed using PROC TRANSREG with ODS
GRAPHICS on.
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= health_cox;
model boxcox(y) = identity(x);
run;
ods graphics off;
Above code generated Box-Cox analysis for Y (Figure 6). Selected lambda (-0.75 at 95% CI) is the exponent to be
used to transform the data into normal shape.
In order to get convenient lambda value, above SAS code is executed without ODS GRAPHICS statement.

Ohio SAS®
Users Conference
10
proc transreg data = health_cox;
model boxcox(y)=identity(x);
run;
This code generated best lambda, lambda with 95% confidence interval and convenient lambda (Table 8).
Convenient lambda is used for transforming Y-variable in this analysis.
Macro ‘BOX_COX_LAMBDA’ is defined below for above codes. Where, ‘PRE_TRANS_DATASET=’ is name of the
input dataset with non-zero and non-negative values; and ‘XVAR=’ and ‘YVAR=’ are names of the X- and Y-variables,
respectively.
%macro box_cox_lambda (pre_trans_dataset= ,xvar= ,yvar= );
title "Box-Cox power transformation: Identification of right exponent
(Lambda)";
ods graphics on;
proc transreg data= &pre_trans_dataset;
model boxcox(&yvar) = identity(&xvar);
run;
ods graphics off;
proc transreg data = &pre_trans_dataset;
model boxcox(&yvar)=identity(&xvar);
run;
%mend box_cox_lambda;
Macro ‘BOX_COX_LAMBDA’ can be called by following statement for the dataset ‘HEALTH_COX’:
%box_cox_lambda (pre_trans_dataset=health_cox, xvar=x ,yvar=y);
DATA STEP program is used to transform Y-variable. Code for common convenient lambda values (-2, -1, -0.5, 0,
0.5, 1 and 2); respective Y-transformations (1/Y
2
, 1/Y, 1/sqrt (Y), log (Y), sqrt (Y), Y and Y
2
); and respective
transformed-Y variable names (neg_2_y, neg_1_y, neg_half_y, zero_y, half_y, one_y, and two_y) are incorporated in
the following data step program.
title "Transformation of Y-variable with convenient lambda";

Ohio SAS®
Users Conference
11
data health_trans_1;
set health_cox;
neg_2_y = 1/(y**2);
neg_1_y = 1/(y**1);
neg_half_y = 1/(sqrt(y));
zero_y = log(y);
half_y = sqrt(y);
one_y = y**1;
two_y = y**2;
run;
proc print data=health_trans_1;
run;
This code generated dataset ‘HEALTH_TRANS_1’ with new transformed Y-variables (Table 9A).
Alternatively, following PROC SQL code is used to generate same dataset with different name.
title "Transformation of Y-values with convenient lambda";
proc sql;
create table health_trans as
select x, y,
1/(y**2) as neg_2_y,
1/(y**1) as neg_1_y,
1/(sqrt(y)) as neg_half_y,
log(y) as zero_y,
sqrt(y) as half_y,
y**1 as one_y,
y**2 as two_y
from health_cox;
quit;
proc print data=health_trans;
run;
PROC SQL generated ‘HEALTH_TRANS’ table (Table 9B). ‘neg_1_y’ is the corresponding transformed Y-variable for
the convenient lambda -1. This ‘neg_1_y’ variable is used for further analysis.

Ohio SAS®
Users Conference
12
Datasets obtained with DATA STEP program and PROC SQL code are compared with PROC COMPARE. Both the
datasets are equal in all aspects (Log 2).
title "Comparison of output obtained by DATA step and PROC SQL methods";
proc compare
base=health_trans
compare=health_trans_1
printall;
run;
Macro ‘TRANSFORM_LAMBDA’ is defined below for above PROC SQL code. Where, ‘PRE_TRANS_DATASET=’ is
name of the input dataset with non-zero and non-negative X- and Y-values; ‘XVAR=’ and ‘YVAR=’ are names of the
X- and Y-variables, respectively; and ‘TRANS_DATASET=’ is name of the output dataset with transformed data.
%macro transform_lambda (pre_trans_dataset= ,xvar= ,yvar= ,trans_dataset= );
title "Transformation of &yvar.-values with convenient lambda";
proc sql;
create table &trans_dataset as
select &xvar, &yvar,
1/(&yvar**2) as neg_2_&yvar,
1/(&yvar**1) as neg_1_&yvar,
1/(sqrt(&yvar)) as neg_half_&yvar,
log(&yvar) as zero_&yvar,
sqrt(&yvar) as half_&yvar,
&yvar**1 as one_&yvar,
&yvar**2 as two_&yvar
from &pre_trans_dataset;
quit;
proc print data=&trans_dataset;
run;
%mend transform_lambda;
Macro ‘TRANSFORM_LAMBDA’ can be invoked by following code for the dataset ‘HEALTH_COX’:
%transform_lambda (pre_trans_dataset=health_cox, xvar=x, yvar=y,
trans_dataset=health_trans);

Ohio SAS®
Users Conference
13
Standardization of X-variable
After transformation of Y-variable, in order to obtain meaningful Y-intercept, X-variable is standardized using PROC
STDIZE. Dataset ‘HEALTH2’ is generated from table ‘HEALTH_TRANS’ in this procedure. OPREFIX option is used
to prefix the original X-variable name with the word, ‘Unstdized_’. On the other hand, standardized X-values are
stored under X.
title "Standardized X-variable after Y-transformation";
proc stdize data=health_trans
oprefix=Unstdized_
method=mean
out=health2;
var x;
run;
proc print data=health2;
run;
Generated dataset ‘HEALTH2’ is shown below with standardized X-variable in the last column as X (Table 10).
Macro ‘STDIZE_X’ is defined below for above code. Where, ‘TRANS_DATASET=’ is name of the input dataset;
‘TRANS_STDIZE_DATASET=’ is name of the output dataset; and ‘XVAR=’ is name of the X-variable to be
standardized.
%macro stdize_x (trans_dataset= ,trans_stdize_dataset= ,xvar= );
title "Standardized &xvar.-variable after Y-transformation";
proc stdize data=&trans_dataset
oprefix=Unstdized_
method=mean
out=&trans_stdize_dataset;
var &xvar;
run;
proc print data=&trans_stdize_dataset;
run;
%mend stdize_x;
Macro ‘STDIZE_X’ can be called by following code for the dataset ‘HEALTH_TRANS’:
%stdize_x (trans_dataset=health_trans, trans_stdize_dataset=health2, xvar=x);

Ohio SAS®
Users Conference
14
Regression Analysis of Transformed-Standardized Data
Regression analysis and normality tests are performed on the transformed and standardized dataset ‘HEALTH2’ by
calling previously defined macro ‘REG_NORMALITY’. Variable X is the standardized X, and ‘neg_1_y’ is the
transformed Y.
%reg_normality (dataset=health2, xvar=x, yvar=neg_1_y);
With transformed data, LACK OF FIT for linear model is turned out to be non-significant, which indicates that the
linear model is acceptable for X and Y (Table 11). Parameter estimates for intercept and X are significant (Table
12A). As compared to the raw data, adjusted R
2
value with transformed data is improved from 0.11 to 0.53 (Table
12B). Other results indicate that transformed data is normally distributed (Figures 6-8; Table 13). Non-significant p-
value with Kolmogorov-Smirnov normality test further confirms that data is normally distributed (Table 14). However,
other tests of normality are still significant, which indicates that there is a room for further improvement of data with
respect to normal distribution. There is at least one outlier and leverage observation that is influencing the normal
distribution (Figure 9).

Ohio SAS®
Users Conference
15

Ohio SAS®
Users Conference
16

Ohio SAS®
Users Conference
17
Outlier and Influential Observations
R, INFLUENCE, RSTUDENTBYLEVERAGE, DFFITS, DFBETAS and COOKSD options are used in the PROC REG
to generate detailed outlier and or influential observations for the dataset ‘HEALTH2’ (Table 15; Figures 10-13).
ods graphics on;
title "Outlier or Influential observations";
proc reg data= health2 plots (only label)= (rstudentbyleverage dffits dfbetas
cooksd);
model neg_1_y = x/r influence;
run;
ods graphics off;
Highest number of asterisks are seen for observation number 43 (Table 15). This observation turned out to be an
outlier and leverage observation (Figure 11). Other results also support that observation number 43 is an outlier and
influencing observation in the dataset ‘HEALTH2’ (Figures 10-13).

Ohio SAS®
Users Conference
18

Ohio SAS®
Users Conference
19

Ohio SAS®
Users Conference
20
Macro ‘OUTLIER_OBS’ is defined below for above code. Where, ‘INDATA=’ is name of the input dataset; and
‘XVAR=’ and ‘YVAR=’ are names of the X- and Y-variables to be used in the analysis, respectively.
%macro outlier_obs (indata= ,xvar= ,yvar= );
ods graphics on;
title "Outlier or influential observations: Dataset &indata";
proc reg data= &indata plots (only label)= (rstudentbyleverage dffits
dfbetas cooksd);
model &yvar = &xvar/r influence;
run;
ods graphics off;
%mend outlier_obs;
Macro ‘OUTLIER_OBS’ can be invoked by following statement for the dataset ‘HEALTH2’:
%outlier_obs (indata=health2, xvar=x, yvar=neg_1_y);
Slicing One Outlier Observation
Dataset ‘SLICED’ for outlier and leverage observation number 43 is created from the dataset ‘HEALTH2’ using
following data step code. SAS supplied observation numbers are used to identify and generate a dataset for outlier(s)
with this code. Alternatively, WHERE statement can be used in the PROC REG to omit observations while performing
the regression analysis.
title "Dataset for outlier observation(s): sliced";

Ohio SAS®
Users Conference
21
data sliced;
do slice=43;
set health2 point=slice;
output;
end;
stop;
run;
proc print data=sliced;
run;
Dataset ‘SLICED’ with one outlier observation is shown below (Table 16):
Macro ‘SLICE_OBS’ is defined below for above code. Where, ‘INDATA=’ is name of the input dataset; ‘OB=’ is outlier
observation number; and ‘SLICED_DATA=’ is name of the output dataset for storing only the outlier observation.
%macro slice_obs (indata= ,ob=0 ,sliced_data= );
title "Dataset for outlier observation(s): &sliced_data";
data &sliced_data;
do slice=&ob;
set &indata point=slice;
output;
end;
stop;
run;
proc print data=&sliced_data;
run;
%mend slice_obs;
Macro ‘SLICE_OBS’ can be called by following statement for the outlier observation number 43 of the dataset
‘HEALTH2’:
%slice_obs (indata=health2, ob=43, sliced_data=sliced);
When observation numbers are not provided, use OB=0 to produce missing values for outlier observations. Further
analysis is not affected by these missing values.
%slice_obs (indata=health2, ob=0, sliced_data=sliced);

Ohio SAS®
Users Conference
22
Dataset without One Outlier Observation
Following DATA STEP program is used to generate dataset ‘HEALTH3_1’ with all the observations of the dataset
‘HEALTH2’ except one that matches with the outlier observation of the dataset ‘SLICED’ (Table 17A). Here, data has
to be sorted before merging. Note that number of observations in the output dataset ‘HEALTH3_1’ are 50 only. Total
real and CPU time required for PROC SORT and DATA STATEMENTS are 0.07 and 0.06 seconds, respectively (Log
3A).
title "Sorting datasets before merging";
proc sort data=health2;
by unstdized_x y;
run;
title "Dataset without outlier observation(s)";
Data health3_1;
merge health2 (in= inhealth)
sliced (in=insliced);
by unstdized_x y;
if inhealth ^= insliced;
run;
proc print data=health3_1;
run;
Alternatively, following PROC SQL code can be used to produce same output in the form of table ‘HEALTH3’ (Table
17B). Unlike data step program, merging datasets can be done without sorting in PROC SQL. Real and CPU time
required for PROC SQL is 0.01 and 0.03 seconds, respectively (Log 3B). In other words, PROC SQL code is shorter
and quicker than DATA STEP program in this example.
title "Dataset without outlier observation(s)";
proc sql;

Ohio SAS®
Users Conference
23
create table health3 as
select* from health2
except all
select* from sliced;
quit;
proc print data= health3;
run;
Output of DATA STEP program and PROC SQL are compared and verified with PROC COMPARE. All the values in
the datasets ‘HEALTH3_1’ and ‘HEALTH3’ are exactly equal in all respects (Log 3C).
title "Comparison of datasets: Data step program vs PROC SQL";
proc compare
base=health3
compare=health3_1
printall;
run;
Macro ‘NO_OUTLIER_DATA’ is defined below for the above PROC SQL code. Where, ‘INDATA=’ is name of the
dataset with all the observations; ‘SLICED_DATA=’ is name of the dataset with only outlier observations; and
‘OUTDATA=’ is name of the output dataset with all the observations except outliers.
%macro no_outlier_data (indata= ,sliced_data= ,outdata=);
title "&outdata.: Dataset without outlier observation(s)";
proc sql;

Ohio SAS®
Users Conference
24
create table &outdata as
select* from &indata
except all
select* from &sliced_data;
quit;
proc print data= &outdata;
run;
%mend no_outlier_data;
Macro ‘NO_OUTLIER_DATA’ can be invoked by following statement for the datasets ‘HEALTH2’ and ‘SLICED’:
%no_outlier_data (indata=health2, sliced_data=sliced, outdata=health3);
Regression Analysis without One Outlier Observation
Regression analysis and normality tests are performed on dataset ‘HEALTH3’ by invoking previously defined macro,
‘REG_NORMALITY’. Dataset ‘HEALTH3’ is devoid of one outlier observation. X is the standardized X-variable, and
‘neg_1_y’ is the transformed Y-variable.
Since LACK OF FIT is non-significant, the linear model for X and Y can be accepted for prediction (Table 18). In the
absence of one outlier observation, as compared to the previous regression analysis, adjusted R
2
value in this
analysis is increased from 0.53 to 0.61 (Tables 19B). Parameter estimates are also modified (Tables 19A). Results
suggest that data is normal (Figures 14 and 15; Table 20). Further improvement may be possible as Shapiro-Wilk
test, one among four tests of the normality, is still significant. Care should be taken to avoid elimination of more
number of observations while improving normal distribution of the data.

Ohio SAS®
Users Conference
25

Ohio SAS®
Users Conference
26
Regression Analysis without Second Outlier Observation
Previously defined macros come in handy while performing this task. No more hard coding is required for this task.
Following macros are invoked to complete this task.
%slice_obs (indata=health3, ob=10,sliced_data=sliced2);
%no_outlier_data (indata=health3, sliced_data=sliced2, outdata=health4);
Outlier and influential observations are produced by calling macro ‘OUTLIER_OBS’. Observation number 10 with
highest asterisks in dataset ‘HEALTH3’ is identified as an outlier (Table 21; Figure 16).

Ohio SAS®
Users Conference
27
Macro ‘SLICE_OBS’ stored one outlier observation (obs # 10) in a dataset named ‘SLICED2’ (Table 22).
Table ‘HEALTH4’ without second outlier observation is generated by calling macro ‘NO_OUTLIER_DATA’ (Table 23).
Note that total observations in the dataset ‘HEALTH4’ are reduced to 49 (Table 23; Log 4).
Regression analysis and normality tests are invoked by macro ‘REG_NORMALITY’ on dataset ‘HEALTH4’. Similar to
the earlier results, further improvement was achieved in parameter estimates and adjusted R
2
(0.7). Data is more

Ohio SAS®
Users Conference
28
normal than previous one (Tables 24, 25A and 25B; Figures 17 and 18). However, Shapiro-Wilk normality test is still
significant (Table 26).

Ohio SAS®
Users Conference
29
Regression Analysis without Third Outlier Observation
For the sake of exploration, further analysis is carried out to eliminate third outlier observation in the data.
Interestingly, invoking macro ‘OUTLIER_OBS’ on ‘HEALTH4’ dataset produced two conflicting outlier observations
(obs # 47 and obs # 28) (Table 27; Figures 19-22). For this reason, further analysis with other macros (SLICE_OBS,
NO_OUTLIER_DATA, and REG_NORMALITY) is carried out separately for both, observation number 47 and 28.

Ohio SAS®
Users Conference
30

Ohio SAS®
Users Conference
31

Ohio SAS®
Users Conference
32
Analysis without outlier observation number 47:
Elimination of observation number 47 did not improve the status of normality tests (Table 28). Shapiro-Wilk test is still
significant.
Analysis without outlier observation number 28:

Ohio SAS®
Users Conference
33
Dataset ‘SLICED3’ for outlier observation number 28 is generated by invoking macro ‘SLICE_OBS’ (Log 5). Note that
dataset ‘HEALTH5’ contained 48 observations only (Log 6). By eliminating observation number 28, all the four tests
of normality are now non-significant (Table 29). Other data supports these results (Figures 23 and 24). Residual-fit
spread plot indicates accountability of the X-variable for the variation in the model (Figure 25). Data pertaining to
analysis of variance, parameter estimates and adjusted R
2
are presented in Table 30, 31A and 31B, respectively.
Like previous analysis, LACK OF FIT for model is non-significant (Table 30). Adjusted R
2
value is further improved to
0.73.

Ohio SAS®
Users Conference
34

Ohio SAS®
Users Conference
35
Linear Regression Models
From the above analysis, linear regression models for raw, normalized, and normalized data without 1 to 3 outlier
observations are given below. After normalization, data started to exhibit true relationship between X and Y. One
should consider several other factors before proceeding to eliminate outlier observations.
Raw data (non-normal): Y = 0.124 + 0.933X (Adjusted R
2
: 0.11)
Normalized data: Y = 1.240 – 0.514X (Adjusted R
2
: 0.53)
Normalized data without 1 outlier observation: Y = 1.213 – 0.653X (Adjusted R
2
: 0.61)
Normalized data without 2 outlier observations: Y = 1.234 – 0.689X (Adjusted R
2
: 0.70)
Normalized data without 3 outlier observations: Y = 1.218 – 0.690X (Adjusted R
2
: 0.73)
Scatter Plots after Normalization
Optionally, relationship between X and Y can be visualized by calling macro ‘SCATTER_CORR’ again for
transformed dataset ‘HEALTH2’ and final dataset ‘HEALTH5’.
Analysis of dataset ‘HEALTH2’:
%scatter_corr (dataset=health2, xvar=x, yvar=neg_1_y);

Ohio SAS®
Users Conference
36
Analysis of dataset ‘HEALTH5’:

Ohio SAS®
Users Conference
37
Pearson correlation coefficient of non-normal raw data is 0.35 (Table 2). Unlike raw data, both the datasets,
‘HEALTH2’ and ‘HEALTH5’ are normal and exhibited strong similar negative relationship between X and Y with
Pearson correlation coefficient values between -0.70 and -0.90 (Figures 26 and 27; Tables 32 and 33).
Master Macros
Three master macros are created for this whole analysis. Upon invoking, these master macros call other macros that
are previously defined.
Master macro 1: IMP_SCATT_CORR_REG_NORMAL
This macro is for initial set of operations (data import from excel file, scatter plot, correlation, regression analysis and
normality tests). There are three macros (EXCEL_IMPORT, SCATTER_CORR and REG_NORMALITY) within this
master macro. All the keyword parameters are described above for each macro. Code for macro EXCEL_IMPORT
may be modified according to the version of excel file.
%macro imp_scatt_corr_reg_normal (excel_file= ,excel_sheet= , dataset= ,xvar= ,
yvar=);
%excel_import (excel_file=&excel_file, excel_sheet=&excel_sheet,
dataset=&dataset);
%scatter_corr (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%mend imp_scatt_corr_reg_normal;
Master macro 2: TRANSFORMATION_BOX_COX
This macro is for transformation of data if it is not normal (conditions apply). There are four macros
(TRANSFORM_ZERO_NEG, BOX_COX_LAMBDA, TRANSFORM_LAMBDA and STDIZE_X) within this master
macro. All the keyword parameters are described above for each macro.
%macro transformation_box_cox (dataset= , pre_trans_dataset= , trans_dataset= ,
trans_stdize_dataset= , xvar= , yvar=);
%transform_zero_neg (dataset=&dataset, xvar=&xvar, yvar=&yvar,
pre_trans_dataset=&pre_trans_dataset);
%box_cox_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar);
%transform_lambda (pre_trans_dataset=&pre_trans_dataset, xvar=&xvar,
yvar=&yvar, trans_dataset=&trans_dataset);

Ohio SAS®
Users Conference
38
%stdize_x (trans_dataset=&trans_dataset,
trans_stdize_dataset=&trans_stdize_dataset, xvar=&xvar);
%mend transformation_box_cox;
Master macro 3: REGRESSION_WOUT_OUTLIERS
This macro is for identification and elimination of outlier observations in the data. It utilizes outlier free data for
regression analysis. There are four macros (REG_NORMALITY, OUTLIER_OBS, SLICE_OBS and
NO_OUTLIER_DATA) within this master macro. All the keyword parameters are described above for each macro.
%macro regression_wout_outliers (dataset= , indata= , ob= ,sliced_data= ,
outdata= ,xvar= , yvar=);
%reg_normality (dataset=&dataset, xvar=&xvar, yvar=&yvar);
%outlier_obs (indata=&indata, xvar=&xvar, yvar=&yvar);
%slice_obs (indata=&indata, ob=&ob, sliced_data=&sliced_data);
%no_outlier_data (indata=&indata, sliced_data=&sliced_data,
outdata=&outdata);
%mend regression_wout_outliers;
Now, whole analysis can be performed on same (or similar) type of data without any further hard coding in a short
period of time by calling master macros in the following manner. It is important to mention location of stored macros
before calling them.
options mautosource sasautos="C:UsersPerlaDocumentsMy SAS
Files9.4statmacros";
%imp_scatt_corr_reg_normal (excel_file=data1,excel_sheet=XY_Data,
dataset=health,xvar=x, yvar=y);
%transformation_box_cox (dataset=health, pre_trans_dataset=health_cox,
trans_dataset=health_trans, trans_stdize_dataset=health2,xvar=x, yvar=y);
For 3
rd
master macro (REGRESSION_WOUT_OUTLIERS), start with a DATASET with transformed Y- and
standardized X-variables. Run this macro first with OB=0, then with OB=obs number(s) to be deleted. In the first run,
identify outlier observation number. In second run, slice this observation from the data. Repeat these two steps until
desired results are achieved with caution.
%regression_wout_outliers (dataset=health2, indata=health2, ob=0,
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
**From this run, it is clear that ob=43 is an outlier;
**For improvement, slice ob=43 from health2;
sliced_data=sliced, outdata=health3, xvar=x , yvar=neg_1_y);
sliced_data=sliced1, outdata=health4, xvar=x , yvar=neg_1_y);

Ohio SAS®
Users Conference
39
**Run again after satisfaction and use model parameters for final use;
Optionally, after above analysis, the relationship between X and Y can be visualized by calling macro
‘SCATTER_CORR’ again for transformed datasets (HEALTH2 and HEALTH5).
Conclusion
In this paper, a simple linear regression model is developed for X- and Y-variables after normalizing replicated raw
data in a systematic manner. Various statistical methods, SAS data step programs and SAS SQL procedures are
employed to achieve this goal. PROC SQL is effectively utilized in place of several data step programs. By bringing
SAS macro language on the board, number of SAS statements required to perform each repeatable task is reduced
to a bare minimum. Furthermore, defined macros are effectively utilized to analyze similar data without much hard
coding within a short period of time.
References
Box, G. E. P. and Cox, D. R. 1964. An analysis of transformations, Journal of the Royal Statistical Society (With
discussion), Series B 26 (2): 211–252.
Buthmann A. Making Data Normal Using Box-Cox Power Transformation, iSix Sigma. Available at
http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/
Carpenter, Art. 2004. Carpenter’s Complete Guide to the SAS
®
Macro Language, Second Edition, SAS® Institute
Inc., Cary, NC, USA.
Lafler, Kirk Paul. 2013. PROC SQL: Beyond the Basics Using SAS
®
, Second Edition, SAS
®
Institute Inc., Cary, NC,
USA.
SAS
®
9.4 Product Documentation, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/94/index.html
SAS/STAT
®
9.3 User's Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#intro_toc.htm
SAS
®
9.2 Macro Language: Reference, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/default/viewer.htm#titlepage.htm
SAS
®
9.3 SQL Procedure User’s Guide, SAS Institute Inc., Cary, NC, USA. Available at
http://support.sas.com/documentation/cdl/en/sqlproc/63043/HTML/default/viewer.htm#titlepage.htm
Acknowledgments
I would like to thank the organizers for giving me an opportunity to present this paper at Ohio SAS® Users
Conference Hosted by CinSUG, CoSUG and CleveSUG on June 1, 2015 at Kingsgate Marriott Conference Center at
the University of Cincinnati, Cincinnati, Ohio.

Ohio SAS®
Users Conference
40
Trademark Citations
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Author Biography
Venu Perla, Ph.D. is a biomedical researcher with about 14 years of research and teaching
experience in an academic environment. He is currently working in West Virginia. He served the
Purdue University, Oregon Health & Science University, Colorado State University, Kerala
Agricultural University (India) and Mangalayatan University (India) at different capacities. Dr.
Perla has published 13 peer reviewed research papers and 2 book chapters, obtained 1
international patent (on orthopaedic implant device), gave 7 talks and presented 18 posters at
national and international scientific conferences in his professional career. He was trained in
clinical trials and clinical data management. He was also trained in advanced SAS® programming
and clinical biostatistics at the University of California, San Diego. Currently, he is actively employing SAS®
programming techniques in his research data analysis.
Contact Information
Phone (Cell): (304) 545-5705
Email: venuperla@yahoo.com
LinkedIn: https://www.linkedin.com/pub/venu-perla/2a/700/468

Ohio SAS®
Users Conference
41
Appendix
Table 1. XY_Data sheet of data1.xls (Microsoft Excel 97-2003 file).
X Y
0.4 0.4
0.6 0.5
2.2 15.3
0.4 0.7
0.1 0.5
0.7 0.6
2.5 1.1
0.4 0.5
0.5 0.6
1.3 0.9
0.4 0.4
1.8 1.6
0.5 1.8
0.5 0.5
0.7 0.7
0.3 0.7
1.4 0.9
0.8 0.6
1.3 1
0.6 0.6
1.2 1
2 2.1
0.7 0.6
1.3 1.1
1.1 1
2 1.3
0.6 0.7
2.1 1.7
1.8 1.4
1.2 0.8
1 0.7
2.1 1.5
1.4 1
0.7 0.8
0.5 0.5
0.9 0.7
1.2 0.5
1.1 0.7
2.5 2
1 0.7
0.9 0.8
3 2.7
4.2 1.5
0.9 1
1.9 1.6
1 0.8
1.2 0.7
0.8 0.7
1.4 0.8
1.4 1.4
1 1

Ohio SAS Users Conference Presentation on Linear Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Ohio SAS Users Conference Presentation on Linear Regression

Similar to Ohio SAS Users Conference Presentation on Linear Regression (20)

Recently uploaded

Recently uploaded (20)

Ohio SAS Users Conference Presentation on Linear Regression