Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Proc Compare to Validate Datasets
                 Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, N...
______________________________________________________               # of Obs with Some Compared Variables Unequal: 0.
   ...
has discrepancies.                                                    var age;
                                           ...
id unique ;                                                                   Obs   Unique   _OBS_    _NAME_      _LABEL_ ...
Upcoming SlideShare
Loading in …5
×

Proc Compare to validate datasets

2,160 views

Published on

  • Be the first to comment

Proc Compare to validate datasets

  1. 1. Proc Compare to Validate Datasets Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC Variables Summary Number of Variables in Common: 3. ABSTRACT Comparison of two datasets is a technique used to know that two Observation Summary datasets are equal or if they have discrepancies. This method can be used to validate that a dataset has been created correctly Observation Base Compare or that the changes in a dataset are only those that are expected. I.e. observations were added or removed or corrections were First Obs 1 1 done. Last Obs 4 4 Number of Observations in Common: 4. INTRODUCTION Total Number of Observations Read from Proc Compare is a procedure that allows two datasets to be WORK.DEMOG: 4. compared for properties, number of observations, number of Total Number of Observations Read from variables, and properties of the datasets. WORK.COMPARE: 4. For a variable, you can get output about differences in: Number of Observations with Some Compared Variables Type, length, formats, informats and label(s). Unequal: 0. For a dataset, you can find differences in: Number of Observations with All Compared Variables date of creation, last modification of the datasets was modified, Equal: 4. number of variables and observations of the datasets. You can NOTE: No unequal values were found. All values also see the labels of the datasets, but differences are not compared are exactly equal. reported for that. For observations, you can compare the value of the record for The records in DEMOG and the records in compare have the each variable. Also, you can decide how different the values of same order and an ID statement is not needed. However if the the observations can be. order of the two datasets is not the same, records might be compared incorrectly. THE DATASETS THAT WILL BE COMPARED. proc print data=demog noobs; WHEN THE DATASETS HAVE A DIFFERENT Unique WEIGHTKG dob ORDER; 20120 101.6 08/19/1949 proc sort data=compare; 20130 94.3 06/02/1934 by DOB; 20110 85.7 09/17/1949 run; 20202 99.3 10/17/1931 proc compare base=demog (keep=unique weightkg) compare=compare (keep=unique weightkg); proc contents data=demog; run; Data Set Name: WORK.DEMOG Observations: 4 As in other SAS procedures and data step, it is possible to select Member Type: DATA Variables: 3 observations and variables in the datasets that are going to be --Alphabetic List of Variables and Attributes-- used. Dataset Created Modified NVar Nobs Variable Type Len Pos Format Label WORK.DEMOG 20JAN03:13:17 20JAN03:13:17 2 4 ----------------------------------------------- WORK.COMPARE 20JAN03:13:17 20JAN03:13:17 2 4 UNIQUE Char 200 32 Unique Record WEIGHTKG Num 8 8 5.1 Weight in kg Vars Summary dob Num 8 24 MMDDYY10. Date of Birth # of Vars in Common: 2. proc sql; Observation Summary create table Compare as select * from demog; quit; Observation Base Compare # of Obs in Common: 4. Total # of Obs Read from WORK.DEMOG: 4. Total # of Obs Read from WORK.COMPARE: 4. AN EXAMPLE OF A CLEAN COMPARE OUTPUT # of Obs with Some Compared Vars Unequal: 3. proc compare base=here.demog compare=compare; # of Obs with All Compared Vars Equal: 1. run; Values Comparison Summary The COMPARE Procedure # of Vars Compared with All Obs Equal: 0. Comparison of WORK.DEMOG with WORK.COMPARE # of Vars Compared with Some Obs Unequal: 2. (Method=EXACT) Total # of Values which Compare Unequal: 6. Maximum Difference: 15.9. Data Set Summary All Vars Compared have Unequal Values Dataset Created Modified Nvar Nobs Variable Type Len Label Ndif MaxDif Unique CHAR 200 Unique Record 3 WORK.DEMOG 14JAN03:16:03 14JAN03:16:06 3 4 WEIGHTKG NUM 8 Weight in kg 3 15.900 WORK.COMPARE 14JAN03:16:06 14JAN03:16:06 3 4 Value Comparison Results for Vars
  2. 2. ______________________________________________________ # of Obs with Some Compared Variables Unequal: 0. || Unique Record # of Obs with All Compared Variables Equal: 3. || Base Value Compare Value Obs || Unique Unique NOTE: No unequal values were found. All values compared _____ || ___________________+ ___________________+ are exactly equal. || 1 || 20120 20202 3 || 20110 20120 4 || 20202 20110 INFORMATION ABOUT VARIABLES OR _______________________________________________________ _______________________________________________________ OBSERVATIONS THAT ARE IN DIFFERENT IN || Weight in kg THE DATASETS BEING COMPARED. || Base Compare Obs || WEIGHTKG WEIGHTKG Diff. % Diff _____ || _________ _________ _________ _________ The output there are no differences, but it might be easy to miss || the fact that only 3 of the 4 observations were compared. Also, 1 || 101.6 99.3 -2.3000 -2.2638 3 || 85.7 101.6 15.9000 18.5531 you don’t know which observations were not compared unless 4 || 99.3 85.7 -13.6000 -13.6959 you use the LIST or LISTALL option. _______________________________________________________ proc compare data=base compare=compare The values of the variable that was to the right of the BASE or list; DATA statement is in the BASE column. The values of dataset id unique; listed to right of the COMPARE dataset is in the COMPARE run; column. It is necessary that the two datasets that are going to be Observation 2 in WORK.DEMOG not found in WORK.COMPARE: compared have the same order UNLESS the order of the Unique=20120. datasets is something that is being tested. Observation 4 in WORK.COMPARE not found in WORK.DEMOG: What happens when the order of the observations in each of the Unique=34343. datasets is different? The LIST option will list the observations and variables that exist proc sql; in one dataset and are missing in the other. There are other update compare set unique='34343' where options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR, unique='20202'; LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND quit; LISTVAR, to limit this report to one or the other dataset and only proc sort data=compare; variables and / or observations. LISTEQUALVAR will also list by unique; the variables that are not used in the ID variable list for which all run; values are equal. proc compare base=demog(keep = unique weightkg) compare=compare(keep = unique weightkg); run; LIMITING THE PRINTED OUTPUT _______________________________________________________ Sometimes it is not necessary to know all the observations that || Unique Record || Base Value Compare Value have an error, once it is known that an error exists, it might not be Obs || Unique Unique needed to know all the discrepancies. _______ || ___________________+ ___________________+ || It might be desirable to limit the size of the output; for that, the 2 || 20120 20130 3 || 20130 20202 option MAXPRINT is very useful. 4 || 20202 34343 _______________________________________________________ First it is necessary to use the original dataset and modify the variable WEIGHTKG to have something to report. Etcetera For example, let’s change the value of one of the variables in the ID statement from ‘34343’ to ‘20120’. Even though, there is only one discrepancy (UNIQUE) in the datasets, several discrepancies are reported. It is necessary to proc sql; specify which observations should be compared . update compare set unique='20120' The option that tells proc compare, which variables should be where unique='34343'; together, is ID. After it, you should list the number of variables that define which observation in each of the datasets should be update compare set weightkg=int(weightkg); compared. This is similar to the way datasets are merged using a by statement. quit; proc compare base=demog compare=compare; proc sort data=compare; id unique; by unique; run; run; proc compare data=demog # of Obs in Common: 3. compare=compare maxprint=(4,2); # of Obs in WORK.DEMOG but not in WORK.COMPARE: 1. id unique ; # of Obs in WORK.COMPARE but not in WORK.DEMOG: 1. run; Total # of Obs Read from WORK.DEMOG: 4. Total # of Obs Read from WORK.COMPARE: 4. At the most, 4 differences will be printed, 2 for every variable that
  3. 3. has discrepancies. var age; with nage; run; Variable Type Len Label Ndif MaxDif WEIGHTKG NUM 8 Weight in kg 4 0.700 # of Vars in Common: 3. _______________________________________________________ # of Vars in WORK.DEMOG but not in WORK.COMPARE: || Weight in kg 1. || Base Compare Unique || WEIGHTKG WEIGHTKG Diff. % Diff _______ || _________ _________ _________ _________ # of Vars in WORK.COMPARE but not in WORK.DEMOG: || 1. 20110 || 85.7 85.0 -0.7000 -0.8168 20120 || 101.6 101.0 -0.6000 -0.5906 # of ID Vars: 1. # of VAR Statement Vars: 1. NOTE: The MAXPRINT=2 printing limit has been reached # of WITH Statement Vars: 1. for the variable WEIGHTKG. No more values will be printed for this comparison. Sometimes, there are differences in the datasets that are not important for the purpose of the comparison. For this scenario, it is possible to give a value for which only observations that are bigger than this number will be marked as a difference. PRINTING RELATED INFORMATION TOGETHER. proc sql; update compare set nage=nage+0.001; quit; Sometimes it is desired to see the information with discrepancies grouped together by the variables in the ID statement. proc compare data=demog compare=compare proc compare data=demog criterion=.01; compare=compare var age; transpose with nage; ; run; id unique ; run; The COMPARE Procedure Unique=20110: Comparison of WORK.DEMOG with WORK.COMPARE Variable Base Value Compare Diff. % Diff (Method=RELATIVE(2.21E-12), Criterion=0.01) WEIGHTKG 85.7 85.0 -0.700000 -0.816803 dob 09/17/1949 09/14/1949 -3.000000 0.079830 Variables Summary Unique=20120: Number of Variables in Common: 3. Variable Base Value Compare Diff. % Diff Number of Variables in WORK.DEMOG but not in WEIGHTKG 101.6 101.0 -0.600000 -0.590551 WORK.COMPARE: 1. dob 08/19/1949 08/16/1949 -3.000000 0.079218 Number of Variables in WORK.COMPARE but not in WORK.DEMOG: 1. Unique=20130: Number of VAR Statement Variables: 1. Variable Base Value Compare Diff. % Diff Number of WITH Statement Variables: 1. WEIGHTKG 94.3 94.0 -0.300000 -0.318134 dob 06/02/1934 05/30/1934 -3.000000 0.032106 Number of Observations in Common: 4. Total Number of Observations Read from WORK.DEMOG: 4. Unique=20202: Total Number of Observations Read from WORK.COMPARE: Variable Base Value Compare Diff. % Diff 4. WEIGHTKG 99.3 99.0 -0.300000 -0.302115 dob 10/17/1931 10/14/1931 -3.000000 0.029118 Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables COMPARING VARIABLES WITH DIFFERENT Equal: 4. NAMES. Sometimes, it is necessary to compare datasets when it is known Values Comparison Summary that the variable names are different yet they should have the Number of Variables Compared with All Observations same values. Equal: 1. Number of Variables Compared with Some Observations Proc sql; Unequal: 0. Total Number of Values which Compare Unequal: 0. alter table compare Total Number of Values not EXACTLY Equal: 4. Maximum Difference Criterion Value: 0.000018868. add nage num label "Numeric age"; alter table demog add age num label "New Numeric age"; CREATING AN OUTPUT DATASET Sometimes, it is desirable to create an output dataset that update demog set contains only the equalities or discrepancies. Here, there is an age=int((date()-dob)/365.25); example to create a dataset that has the differences. The outnoequal option, specifies that only the observations with update compare set discrepancies will exist in the dataset toprint. nage=int((date()-dob)/365.25); quit; proc compare data=demog compare=compare proc compare data=demog outnoequal compare=compare; out=toprint id unique; ;
  4. 4. id unique ; Obs Unique _OBS_ _NAME_ _LABEL_ BASE COMPARE run; 1 20110 1 WEIGHTKG Weight in kg 85.7 85 3 20120 2 WEIGHTKG Weight in kg 101.6 101 Proc Print data=toprint; 5 20130 3 WEIGHTKG Weight in kg 94.3 94 7 20202 4 WEIGHTKG Weight in kg 99.3 99 run; Variables with Unequal Values Variable Type Len Label Ndif MaxDif CONCLUSION WEIGHTKG NUM 8 Weight in kg 4 0.700 You are the person who decides what is important to validate, Value Comparison Results for Variables values of the observations and how different can they be. Is it important to have the same labels, formats and informats in a _______________________________________________________ variable. Proc Compare is a tool that is flexible to allow you find ||Weight in kg the differences that are important to you. || Base Compare Unique || WEIGHTKG WEIGHTKG Diff. % Diff _________ || ________ ________ ________ _________ ACKNOWLEDGMENTS || 20110 || 85.7 85.0 -0.7000 -0.8168 I want to thank, Craig Mauldin, Andy Barnett, George Clark, 20120 || 101.6 101.0 -0.6000 -0.5906 Bonnie Duncan and Kim Sturgen for their comments and support. 20130 || 94.3 94.0 -0.3000 -0.3181 20202 || 99.3 99.0 -0.3000 -0.3021 ______________________________________________________ CONTACT INFORMATION OUTPUT OF PROC PRINT Your comments and questions are valued and encouraged. Contact the author at: Obs _TYPE_ _OBS_ Unique WEIGHTKG dob 1 DIF 1 20110 -0.7 E A. Cecilia Casas 2 DIF 2 20120 -0.6 E PPD Development 3 DIF 3 20130 -0.3 E 3900 Paramount Parkway 4 DIF 4 20202 -0.3 E Morrisville, NC 27560 It is possible to use the noprint option, but then it is strongly Phone: (919) 462-4199 recommended to use the options OUTCOMP AND OUTBASE to Fax: (919) 379-6151 differentiate the variables. Email: Cecilia.Casas@rtp.ppdi.com proc compare data=demog compare=compare SAS and all other SAS Institute Inc. product or service names are outnoequal registered trademarks or trademarks of SAS Institute Inc. in the out=toprint noprint USA and other countries. ® indicates USA registration. outcomp outbase; Other brand and product names are trademarks of their id unique ; respective companies. run; proc print data=toprint; run; Obs _TYPE_ _OBS_ Unique WEIGHTKG dob 1 BASE 1 20110 85.7 09/17/1949 2 COMPARE 1 20110 85.0 09/17/1949 3 BASE 2 20120 101.6 08/19/1949 4 COMPARE 2 20120 101.0 08/19/1949 5 BASE 3 20130 94.3 06/02/1934 6 COMPARE 3 20130 94.0 06/02/1934 7 BASE 4 20202 99.3 10/17/1931 8 COMPARE 4 20202 99.0 10/17/1931 A USEFUL DATASET TO REPORT To get only the observations and variables that have discrepancies, proc transpose can be a big help in reporting the findings. proc transpose data=toprint out=transp; by unique _obs_; id _type_; run; proc print data=transp(where=(base^=compare)); run;

×