In this paper, the author will not only provide a brief introduction of the COMPARE procedure, but will also share with you her experience to speed up the validation process. Comparison discussed in this paper is mainly between two datasets, even though it is possible to compare variables within the same dataset using the COMPARE procedure. Below are the topics that will be covered in this paper.
• A brief introduction to the COMPARE procedure
• Orthodox Use of Features in PROC COMPARE to speed up the validation of a dataset
• Unorthodox Use of Features in PROC COMPARE to speed up the validation of a dataset
• A macro named %QCDATA to speed up the validation of a single dataset
• Introduction to &SYSINFO generated by the COMPARE procedure
• A macro named %QCDIR to identify all datasets in the production and validation directories
5. 7 #WUSS15
Why Validate?
The Latin motto Finis origine pendet
means “The end depends on the
beginning”.
So if your data is clean,
your analysis result
cannot be accurate.
6. Overview
PROC COMPARE allows users to:
Compare 2 datasets;
Compare variables against variables of the
same dataset;
Compare variables against variables of
different names between 2 datasets.
8 #WUSS15
8. Scope (continued)
Compare 2 datasets and their attributes (Type, Creation
and Modified Dates, Number of Observations, Number
of Variables, Number of Observations and labels).
Check if the Same Variables in both datasets.
Check for Variable Attributes (Type, Length, Format,
Informat, Label).
Check if Observations Matched in ID variables.
Check if there are Duplicated Observations based on ID
variables.
Comparison of Variable Values.
10 #WUSS15
9. Syntax
PROC COMPARE <option(s)>;
BY <DESCENDING> variable-1
<DESCENDING> variable-2 …
<NOTSORTED>;
ID <DESCENDING> variable-1 …
<DESCENDING> variable-2 …
<NOTSORTED>;
VAR variable(s);
WITH variable(s);
run;
11 #WUSS15
10. Syntax Used Here:
PROC COMPARE <option(s)>;
BY <DESCENDING> variable-1
<DESCENDING> variable-2 …
<NOTSORTED>;
ID <DESCENDING> variable-1 …
<DESCENDING> variable-2 …
<NOTSORTED>;
VAR variable(s);
WITH variable(s);
RUN;
12 #WUSS15
11. Some Useful Options in PROC COMPARE
Statement
BASE = option specifies the dataset use as the base dataset
for comparisons. Alias: DATA = option.
COMPARE = option specifies the dataset to be used as the
comparison dataset.
LISTVAR lists all variables found in only one dataset.
LISTOBS lists all observations that are found only in one
dataset.
LISTALL is equivalent to applying both LISTVAR and LISTOBS.
LISTEQUALVAR lists variables that are matched in values.
13 #WUSS15
12. Some Useful Options in PROC COMPARE
Statement –continued
ALLOBS includes in the report of value comparison results
the values and for numeric variables, the differences for all
matching observations, even if they are judged equal.
ALLSTATS prints a table of summary statistics for all pairs of
matching variables.
METHOD=ABSOLUTE|EXACT|PERCENT|RELATIVE specifies
the method for judging the equality of numeric values. By
default, METHOD=EXACT.
CRITERION=gamma specifies the criterion for judging the
equality of numeric values. Default value is 0.0001.
14 #WUSS15
13. Some Useful Options in PROC COMPARE
Statement –continued
TRANSPOSE prints the reports of value difference by
observation instead of by variable.
BRIEFSUMMARY prints only a short comparison summary.
NOVALUES suppresses the report of the values comparison
results.
NOPRINT suppress all printed output.
MAXPRINT=total|(per-variable, total) specifies the
maximum number of differences to be printed.
15 #WUSS15
14. Some Useful Options in PROC COMPARE
Statement –continued
OUTBASE, OUTCOMP, OUTDIF, OUTPERCENT and OUT=
options generate output dataset with the observations
from the BASE dataset, the COMPARE dataset and an
observation to indicate the difference and the difference in
percent between 2 datasets for any pair of observations, if
any.
OUTALL is equivalent to using these 4 options: OUTBASE,
OUTCOMP, OUTDIF and OUTPERCENT.
OUTNOEQUAL suppresses the writing of an observation to
the output dataset when all values in the observation are
judged equal.
16 #WUSS15
15. PROC COMPARE Statements
BY Statement allows BY group comparison. Please make
sure BASE and COMPARE datasets are sorted according to
the variables stated in the BY statement.
ID Statement helps one to easily identify the
observations(s) with discrepancies. Sorting the BASE and
COMPARE datasets according to ID variables is highly
recommended, despite a NOTSORTED option is available.
VAR Statement enables one to specify the variables to be
compared.
WITH Statement restricts the comparison of the values of
variables to the ones named in this statement.
17 #WUSS15
16. 18 #WUSS15
A Simple Example Using PROC COMPARE –
BASE Dataset PROD.ADSL
Note: The RED text above represents problematic data.
17. 19 #WUSS15
A Simple Example Using PROC COMPARE –
COMPARE Dataset QC.ADSL
Note: The RED text above represents problematic data.
18. Discrepancies: PROD.ADSL vs QC.ADSL
PROC ADSL has 8 variables and 8 observations while
QC.ADSL has 7 variables and 7 observations.
Variable AGEU exists only in PROD.ADSL.
There are some discrepancies in values of variables
HEIGHTBL, BMIBL and COUNTRY.
Subject 302 has an observation only in PROC.ADSL.
Subject 303 has 2 observations in PROD.ADSL, even though
ADSL is supposed to have only one record per subject.
Subject 303 has only one single observation in QC.ADSL.
Subject 304 exists only in QC.ADSL.
20 #WUSS15
19. 21 #WUSS15
EXAMPLE 1: COMPARE PROD.ADSL vs QC.ADSL
PROC COMPARE
BASE=PROD.ADSL COMPARE=QC.ADSL
LISTVAR LISTOBS LISTEQUALVAR
MAXPRINT=(20, 2400);
ID SUBJID;
RUN;
20. Again the meaning of these options are:
LISTVAR lists all variables found in only one dataset.
LISTOBS lists all observations that are found only in one
dataset.
LISTEQUALVAR lists variables that are matched in values.
MAXPRINT=total|(per-variable, total) specifies the
maximum number of differences to be printed.
22 #WUSS15
25. 27 #WUSS15
EXAMPLE 1: PROC COMPARE OUTPUT (Continued)
(5). Value Comparison Results for Variables
26. 28 #WUSS15
EXAMPLE 1: PROC COMPARE OUTPUT (Continued)
(5). Value Comparison Results for Variables
Note: BMIBL is partly based on HEIGHTBL. So discrepancies in BMIBL for
Subject 202 is most likely the result of discrepancy in HEIGHTBL value.
27. 29 #WUSS15
EXAMPLE 1: PROC COMPARE OUTPUT (Continued)
(5). Value Comparison Results for Variables
Truncated. Limited to 20 characters.
Note: The value for COUNTRY has been truncated because PROC COMPARE
limited the display to 20 characters. This is true for the ID variables, as well as,
the variables to be compared in this report. The plus symbols (+) are likely to
mean the variables has a longer length than display.
28. 30 #WUSS15
EXAMPLE 2: COMPARE PROD.ADSL vs.
PROD.ADSL PERFECT MATCH
PROC COMPARE
BASE=PROD.ADSL COMPARE=PROD.ADSL
LISTVAR LISTOBS LISTEQUALVAR;
ID SUBJID;
run;
Note: Since the BASE dataset and the COMPARE dataset are
the same, it is guaranteed to be a PERFECT MATCH!
29. 31 #WUSS15
EXAMPLE 2: COMPARE PROD.ADSL vs.
PROD.ADSL PERFECT MATCH (Continued)
Eureka!!! This is the favorite text that makes every
validator rejoice!
30. 32 #WUSS15
EXAMPLE 2: COMPARE PROD.ADSL vs.
PROD.ADSL PERFECT MATCH (Continued)
NOTE: No unequal values were found.
All values compared are exactly
equal.
Now for Example 2, we know these statements are true
since we are comparing PROD.ADSL with itself!!!
However, these statements are NOT the Holy Grail to
judge if this is truly a PERFECT MATCH!!!
31. A Highly Recommended Paper
Don’t Get Blindsided by PROC COMPARE
Joshua Horstman and Roger Muller
This paper contains 4 excellent examples why the
following statements
NOTE: No unequal values were found.
All values compared are exactly equal.
are NOT the ‘Holy Grail’, as described by Horstman and
Muller!!!
33 #WUSS15
32. 34 #WUSS15
What is a Holy Grail?
The Holy Grail is a dish, plate, stone or cup that is part of
an important theme of Arthurian literature.
According to legend, it has special powers and is designed
to provide Happiness, Eternal Youth and Food in Infinite
Abundance. (Wikipedia)
33. 35 #WUSS15
What is the Holy Grail in validation?
We will get to this in detail later …
34. Another Highly Recommended Paper
A highly recommended paper on PROC COMPARE
introduction:
PROC COMPARE – Worth Another Look!
Christianna S. Williams
This paper contains 10 excellent examples starting from
the very basic.
36 #WUSS15
37. ORTHODOX USE TO SPEED UP VALIDATION
Use Divide and Conquer Techniques
Use WHERE statement
39 #WUSS15
PROC COMPARE BASE=PROD.ADSL
COMPARE=QC.ADSL LISTVAR LISTOBS
LISTEQUALVAR;
ID SUBJID;
WHERE SUBJID=‘303’;
RUN;
38. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Use Divide and Conquer Techniques
Use VAR statement
40 #WUSS15
PROC COMPARE BASE=PROD.ADSL
COMPARE=QC.ADSL LISTVAR LISTOBS
LISTEQUALVAR;
ID SUBJID;
VAR HEIGHTBL WEIGHTBL BMIBL;
RUN;
39. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Use Divide and Conquer Techniques
Apply DROP = to both BASE and COMPARE datasets
41 #WUSS15
PROC COMPARE
BASE=PROD.ADSL (drop=HEIGHTBL WEIGHTBL BMIBL)
COMPARE=QC.ADSL (drop=HEIGHTBL WEIGHTBL BMIBL)
LISTVAR LISTOBS LISTEQUALVAR;
ID SUBJID;
RUN;
40. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Be Parsimonious in ID Variables
42 #WUSS15
Due to limited space in a page, SAS can only print a limited
number of ID variables. You want to keep those ID variables
that are critical.
Example: ID Variables for ADLB in one Single Study
Study ID, Subject ID, Lab Category, Lab Parameter, Analysis Visit, Analysis Date
41. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Be Parsimonious in ID Variable Values
43 #WUSS15
Remember PROC COMPARE output prints only 20 characters
for each variable. So keep the part of the value that are
meaningful!
42. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Be Parsimonious in ID Variable Values
44 #WUSS15
Remember PROC COMPARE output prints at most 20
characters for each variable. So keep the part of the value
that are meaningful!
Example: Use SUBJID instead of USUBJID
USUBJID = ‘ABCD-1234-56-0000000001’
SUBJID = ‘0000000001’
43. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Reduce the Amount of Printed Output
45 #WUSS15
Use MAXPRINT = total | (per-variable, total)
total = Maximum Total Number of Differences to be printed.
per-variable = Maximum Total Number of Differences to
be printed for each variable within BY group.
44. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Output Results in a Dataset, Instead of Printing
46 #WUSS15
• Use OUTBASE, OUTCMP, OUTDIF, OUTPERCENT,
OUTNOEQUAL and OUT= options.
• Suppress printing with NOPRINT option.
45. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Output Results in a Dataset, Instead of Printing
47 #WUSS15
• Use OUTBASE, OUTCMP, OUTDIF, OUTPERCENT,
OUTNOEQUAL and OUT= options.
• Suppress printing with NOPRINT option.
PROC COMPARE BASE=PROD.ADSL COMPARE=QC.ADSL
OUT=OUT_EX1 OUTBASE OUTCOMP OUTDIF
OUTPERCENT OUTNOEQUAL NOPRINT;
ID SUBJID;
RUN;
46. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Output Results in a Dataset, Instead of Printing
48 #WUSS15
OUT_EX1 generated by OUT=OUT_EX1, OUTBASE, OUTCOMP, OUTDIF and
OUTPERCENT Options
47. ORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Output Results in a Dataset, Instead of Printing
49 #WUSS15
Art Carpenter’s Method to Output Data with Discrepancies
• Transpose the output dataset from PROC COMPARE.
• This method creates a report that displays all
discrepancies for different variables one subject at a
time.
Reference: Carpenter’s Guide to Innovative SAS® Techniques
50. 52 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
Recall when comparing PROD.ADSL against QC.ADSL,
there are 2 discrepancies in BMIBL.
BMI = (Weight in Kg) / (Height in meter) 2
51. UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
• Put Source Variables from BASE and COMPARE
Datasets in the ID Statement
53 #WUSS15
52. UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
• Put Source Variables from BASE and COMPARE
Datasets in the ID Statement
54 #WUSS15
PROC SORT DATA=PROD.ADSL OUT=PROD_ADSL;
BY SUBJID WEIGHTBL HEIGHTBL; RUN;
PROC SORT DATA=PROD.ADSL OUT=PROD_ADSL;
BY SUBJID WEIGHTBL HEIGHTBL; RUN;
PROC COMPARE BASE=PROD_ADSL COMPARE=QC_ADSL
LISTVAR LISTOBS LISTEQUALVAR;
ID SUBJID WEIGHTBL HEIGHTBL ;
RUN;
54. 56 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
OUTPUT:
Note: For Subject 202, the value for HEIGHTBL in PROD_ADSL
is different from that in QC_ADSL. Hence, the discrepancy in
BMIBL value is likely due to the discrepancy in the source
variable HEIGHTBL.
55. 57 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
OUTPUT:
Note: The value for source variables HEIGHTBL and
WEIGHTBL are the same, so the discrepancies in BMIBL is
likely due to miscalculation or a misunderstanding of its
definition.
56. UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
• Put Both Source Variables and the Variable to be
Compared in the ID Statement
58 #WUSS15
PROC SORT DATA=PROD.ADSL OUT=PROD_ADSL;
BY SUBJID WEIGHTBL HEIGHTBL BMIBL; RUN;
PROC SORT DATA=PROD.ADSL OUT=PROD_ADSL;
BY SUBJID WEIGHTBL HEIGHTBL BMIBL; RUN;
PROC COMPARE BASE=PROD_ADSL COMPARE=QC_ADSL
LISTVAR LISTOBS LISTEQUALVAR;
ID SUBJID WEIGHTBL HEIGHTBL BMIBL;
RUN;
57. 59 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
OUTPUT:
Now the 2 pairs of observations line up nicely for you to see!
58. 60 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
OUTPUT:
Now the 2 pairs of observations line up nicely for you to see!
Same HEIGHTBL and WEIGHTBL values, but
different BMIBL value. Most likely, miscalculation!
59. 61 #WUSS15
UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
OUTPUT:
Now the 2 pairs of observations line up nicely for you to see!
Different HEIGHTBL values and different BMIBL
values. Most likely, discrepancies is due to
values in source variable HEIGHTBL!
60. UNORTHODOX USE TO SPEED UP VALIDATION
(Continued)
Caveat
Of course, SOURCE variables have to be still in both
BASE and COMPARE datasets.
For this method, please put variables in ID statement
in the following order:
ID Original ID Variables
Source Variables
Variable to be Compared;
62 #WUSS15
61. UNORTHODOX USE TO SPEED UP VALIDATION
(continued)
Caveat
There is at least one variables other than the original
ID variables, the source variables and the variable to
be compared in both the BASE and SOURCE dataset.
Otherwise, you will get a message like:
NOTE: Except for the 4 ID variables, the data
sets WORK.PROD_ADSL and WORK.QC_ADSL have no
variables to compare. Comparisons of data values
not performed.
63 #WUSS15
65. %QCDATA
%QCDATA achieves the following:
Sort BASE and COMPARE dataset based on ID
variables.
Identify and print out duplicate observations
of BASE and COMPARE datasets upon
request, if applicable.
Allow users to specify ID variables to be used
in PROC COMPARE.
67 #WUSS15
66. %QCDATA (Continued)
%QCDATA achieves the following:
Allow users to specify ID variables to be used in
PROC COMPARE.
Allow users to drop variables no to be compared.
Perform PROC COMPARE.
Generate PROC COMPARE report, output dataset
and a dataset with the flag variables for duplication,
return code from PROC COMPARE and the status of
16 conditions based on &SYSINFO.
68 #WUSS15
67. 69 #WUSS15
%QCDATA (Continued)
%macro QCDATA (
BASE= /*Specify BASE dataset. */
, COMPARE= /*Specify COMPARE dataset. */
,VARS= /*Specify variables to be compared. */
,DROPBASEVARS= /*Specify BASE variable to be dropped. */
,DROPCOMPVARS= /*Specify COMPARE variable to be dropped. */
,PRINTDUPS=Y /*Print Duplicate Observations. */
,COMPRPT=Y /*Generate PROC COMPARE report. */
,CREATEOUTDS=Y/*Create Output Dataset from PROC COMPARE.*/
,OUTDATA= /*Specify the name of Output Dataset. */
,OUTRC= /*Specify the Flag Variable/RC Data Name. */
,REFRESH = N /*Refresh dataset specified in OUTRC- */
);
77. 79 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE
&SYSINFO contains a return code
summarizing the result of
comparison of the BASE dataset
against the COMPARE dataset.
78. 80 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
PROC COMPARE …;
… more codes …;
run;
%let RC = &SYSINFO;
Note: It is critical to
get the value of
&SYSINFO
immediately after
PROC COMPARE or
its value will change.
80. 82 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
&SYSINFO is the sum of the code
for each condition that fails and
the conditions are …
85. 87 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
Now because the code for each
Failure Description is of 2**X Format
(X=1, …, 16) and no code will be
repeated in a test,
Once we know the sum of the codes
(i.e., the value of &SYSINFO),
we will know what condition(s), if any,
has/have been violated.
86. 88 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
Example
Say, &SYSINFO resolves to 48.
From the table, we know that is the result of
16 + 32 = 48
87. 89 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
So, &SYSINFO = 48 = 16 + 32
So Condition 5 (Variable
has different length) and
Condition 6 (Variable has
different label) have been
Violated.
88. 90 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
So again, this indicates Condition 5 (Variable has different length)
and Condition 6 (Variable has different label) have been
violated!!!
89. 91 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
48 is an easy number to see its sum from
the table.
What if we have say 5357? What are the
code that adds up to this?
Ok. We can put the 5357 in BINARY16.
format and found the position with 1s.
Better yet, perform Bit Map Testing.
90. 92 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
Code for Bit Map Testing (found within %QCDATA)
91. 93 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
Code for Bit Map Testing (found within %QCDATA)
/* Test for Data Set Label. */
if RC='1'b then do;
RC_1='Y';
put '<<< Data set labels differ.';
end;
/* Test for Data Set Types. */
if RC='1.'b then do;
RC_2='Y'; RC_2 assigned.
put '<<< Data set types differs';
end;
RC_1 assigned.
92. 94 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
Code for Bit Map Testing (found within %QCDATA)
/* Test for Data Set Label. */
if RC='1'b then do;
RC_1='Y';
put '<<< Data set labels differ.';
end;
/* Test for Data Set Types. */
if RC='1.'b then do;
RC_2='Y';
put '<<< Data set types differs';
end;
93. 95 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
• A &SYSINFO < 64 means the differences
are only due to ATTRIBUTES.
94. 96 #WUSS15
Introduction to &SYSINFO from
PROC COMPARE (Continued)
• Remember the Holy Grail?
It is &SYSINFO=0!!!
It means NO ERROR Found!!!
In my RC dataset output,
&SYSINFO=0 is equivalent
to RC_0=‘Y’.
97. 99 #WUSS15
%QCDIR to validate datasets in directory
PROD and QC are the directories and ADXXs are the datasets.
98. 100 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
Note: DICT.IDVARS contains the IDVARS for the dataset.
99. 101 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
By means of PROC CONTENTS of PROD._ALL_ and QC._ALL_,
One can identify all datasets within the directory. Then
Merge with DICT.IDVARS to create WORK.ALLDS above.
100. 102 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
By means of PROC CONTENTS of PROD._ALL_ and QC._ALL_,
One can identify all datasets within the directory. Then
Merge wit DICT.IDVARS to create WORK.ALLDS. This is performed
within %QCDIR.
Parameters in %QCDIR …
101. 103 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
Code from %QCDIR
Use CALL EXECUTE (%NRSTR (…)) to execute %QCDATA.
102. 104 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
Code from %QCDIR
Note: using CALL EXECUTE (%NRSTR (…)) to execute
%QCDATA is important. CALL EXECUTE(…) by itself will
generate incorrect result. This is because CALL EXECUTE
executes immediately, but here we need to resolve the
value of &SYSINFO for each comparison. Adding
%NRSTR( ) delays the execution.
CALL EXECUTE (‘%NRSTR(%QCDATA(BASE=‘||BASE||
‘,COMPARE=‘||COMPARE||
‘,IDVARS=‘||IDVARS||
‘,OUTRC=ALLRC,
REFRESH=Y));’);
104. 106 #WUSS15
%QCDIR to validate datasets in directory
(Continued)
Figure 5: Comparison of Datasets in Production vs. Validation Process
(Continued)
105. 107 #WUSS15
Conclusion
In this paper, the author goes through tips and
techniques to speed up the validation process. For
‘Without Automation’ techniques, Orthodox and
Unorthodox methods are introduced.
For ‘Automation’ Part, we have introduced %QCDATA,
&SYSINFO and %QCDIR.
‘How to Speed Up Your Validation Process Without
Really Trying?’ Maybe you do need to try a bit. But
hopefully, these techniques have speeded up your
Validation Process. Happy Compare-ing!!!
106. Acknowledgment
Art Carpenter, Caloxy Inc.
Jane Eslinger, SAS Technical Support
Boyd Roloff, Chiltern International
Kevin Russell, SAS Technical Support
108 #WUSS15
107. Contact Information
Name: Alice Cheng
Enterprise: Chiltern International
E-mail: alice.cheng@chiltern.com
alice_m_cheng@yahoo.com
109 #WUSS14