Proc Compare to validate datasets

guestb7f80c 1,635 views 4 slides Dec 09, 2009
Slide 1
Slide 1 of 4
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4

About This Presentation

No description available for this slideshow.


Slide Content

Proc Compare to Validate Datasets
Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC
ABSTRACT
Comparison of two datasets is a technique used to know that two
datasets are equal or if they have discrepancies. This method
can be used to validate that a dataset has been created correctly
or that the changes in a dataset are only those that are expected.
I.e. observations were added or removed or corrections were
done.
INTRODUCTION
Proc Compare is a procedure that allows two datasets to be
compared for properties, number of observations, number of
variables, and properties of the datasets.
For a variable, you can get output about differences in:
Type, length, formats, informats and label(s).
For a dataset, you can find differences in:
date of creation, last modification of the datasets was modified,
number of variables and observations of the datasets. You can
also see the labels of the datasets, but differences are not
reported for that.
For observations, you can compare the value of the record for
each variable. Also, you can decide how different the values of
the observations can be.
THE DATASETS THAT WILL BE COMPARED.
proc print data=demog noobs;
Unique WEIGHTKG dob
20120 101.6 08/19/1949
20130 94.3 06/02/1934
20110 85.7 09/17/1949
20202 99.3 10/17/1931
proc contents data=demog;
Data Set Name: WORK.DEMOG Observations: 4
Member Type: DATA Variables: 3
--Alphabetic List of Variables and Attributes--
Variable Type Len Pos Format Label
-----------------------------------------------
UNIQUE Char 200 32 Unique Record
WEIGHTKG Num 8 8 5.1 Weight in kg
dob Num 8 24 MMDDYY10. Date of Birth
proc sql;
create table Compare as select * from demog;
quit;
AN EXAMPLE OF A CLEAN COMPARE OUTPUT
proc compare base=here.demog compare=compare;
run;
The COMPARE Procedure
Comparison of WORK.DEMOG with WORK.COMPARE
(Method=EXACT)
Data Set Summary
Dataset Created Modified Nvar Nobs
WORK.DEMOG 14JAN03:16:03 14JAN03:16:06 34
WORK.COMPARE 14JAN03:16:06 14JAN03:16:06 34
Variables Summary
Number of Variables in Common: 3.
Observation Summary
Observation Base Compare
First Obs 1 1
Last Obs 4 4
Number of Observations in Common: 4.
Total Number of Observations Read from
WORK.DEMOG: 4.
Total Number of Observations Read from WORK.COMPARE: 4.
Number of Observations with Some Compared Variables
Unequal: 0.
Number of Observations with All Compared Variables Equal: 4.
NOTE: No unequal values were found. All values
compared are exactly equal.
The records in DEMOG and the records in compare have the
same order and an ID statement is not needed. However if the
order of the two datasets is not the same, records might be
compared incorrectly.
WHEN THE DATASETS HAVE A DIFFERENT
ORDER;
proc sort data=compare;
by DOB;
run;
proc compare base=demog (keep=unique weightkg)
compare=compare (keep=unique weightkg);
run;
As in other SAS procedures and data step, it is possible to select
observations and variables in the datasets that are going to be
used.
Dataset Created Modified NVar Nobs
WORK.DEMOG 20JAN03:13:17 20JAN03:13:17 2 4
WORK.COMPARE 20JAN03:13:17 20JAN03:13:17 2 4
Vars Summary
#ofVarsinCommon:2.
Observation Summary
Observation Base Compare
# of Obs in Common: 4.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
# of Obs with Some Compared Vars Unequal: 3.
# of Obs with All Compared Vars Equal: 1.
Values Comparison Summary
# of Vars Compared with All Obs Equal: 0.
# of Vars Compared with Some Obs Unequal: 2.
Total # of Values which Compare Unequal: 6.
Maximum Difference: 15.9.
All Vars Compared have Unequal Values
Variable Type Len Label Ndif MaxDif
Unique CHAR 200 Unique Record 3
WEIGHTKG NUM 8 Weight in kg 3 15.900
Value Comparison Results for Vars

______________________________________________________
|| Unique Record
|| Base Value Compare Value
Obs || Unique Unique
_____ || ___________________+ ___________________+
||
1 || 20120 20202
3 || 20110 20120
4 || 20202 20110
_______________________________________________________
_______________________________________________________
|| Weight in kg
|| Base Compare
Obs || WEIGHTKG WEIGHTKG Diff. % Diff
_____ || _________ _________ _________ _________
||
1 || 101.6 99.3 -2.3000 -2.2638
3 || 85.7 101.6 15.9000 18.5531
4 || 99.3 85.7 -13.6000 -13.6959
_______________________________________________________
The values of the variable that was to the right of the BASE or
DATA statement is in the BASE column. The values of dataset
listed to right of the COMPARE dataset is in the COMPARE
column.
It is necessary that the two datasets that are going to be
compared have the same order UNLESS the order of the
datasets is something that is being tested.
What happens when the order of the observations in each of the
datasets is different?
proc sql;
update compare set unique='34343' where
unique='20202';
quit;
proc sort data=compare;
by unique;
run;
proc compare base=demog(keep = unique weightkg)
compare=compare(keep = unique weightkg);
run;
_______________________________________________________
|| Unique Record
|| Base Value Compare Value
Obs || Unique Unique
_______ || ___________________+ ___________________+
||
2 || 20120 20130
3 || 20130 20202
4 || 20202 34343
_______________________________________________________
Etcetera
Even though, there is only one discrepancy (UNIQUE) in the
datasets, several discrepancies are reported. It is necessary to
specify which observations should be compared .
The option that tells proc compare, which variables should be
together, is ID. After it, you should list the number of variables
that define which observation in each of the datasets should be
compared. This is similar to the way datasets are merged using
a by statement.
proc compare base=demog compare=compare;
id unique;
run;
# of Obs in Common: 3.
# of Obs in WORK.DEMOG but not in WORK.COMPARE: 1.
# of Obs in WORK.COMPARE but not in WORK.DEMOG: 1.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
# of Obs with Some Compared Variables Unequal: 0.
# of Obs with All Compared Variables Equal: 3.
NOTE: No unequal values were found. All values compared
are exactly equal.
INFORMATION ABOUT VARIABLES OR
OBSERVATIONS THAT ARE IN DIFFERENT IN
THE DATASETS BEING COMPARED.
The output there are no differences, but it might be easy to miss
the fact that only 3 of the 4 observations were compared. Also,
you don’t know which observations were not compared unless
you use the LIST or LISTALL option.
proc compare data=base
compare=compare
list;
id unique;
run;
Observation 2 in WORK.DEMOG not found in WORK.COMPARE:
Unique=20120.
Observation 4 in WORK.COMPARE not found in WORK.DEMOG: Unique=34343.
The LIST option will list the observations and variables that exist
in one dataset and are missing in the other. There are other
options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,
LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND
LISTVAR, to limit this report to one or the other dataset and only
variables and / or observations. LISTEQUALVAR will also list
the variables that are not used in the ID variable list for which all
values are equal.
LIMITING THE PRINTED OUTPUT
Sometimes it is not necessary to know all the observations that
have an error, once it is known that an error exists, it might not be
needed to know all the discrepancies.
It might be desirable to limit the size of the output; for that, the
option MAXPRINT is very useful.
First it is necessary to use the original dataset and modify the
variable WEIGHTKG to have something to report.
For example, let’s change the value of one of the variables in the
ID statement from ‘34343’ to ‘20120’.
proc sql;
update compare set unique='20120'
where unique='34343';
update compare set weightkg=int(weightkg);
quit;
proc sort data=compare;
by unique;
run;
proc compare data=demog
compare=compare
maxprint=(4,2);
id unique ;
run;
At the most, 4 differences will be printed, 2 for every variable that

has discrepancies.
Variable Type Len Label Ndif MaxDif
WEIGHTKG NUM 8 Weight in kg 4 0.700
_______________________________________________________
|| Weight in kg
|| Base Compare
Unique || WEIGHTKG WEIGHTKG Diff. % Diff
_______ || _________ _________ _________ _________
||
20110 || 85.7 85.0 -0.7000 -0.8168
20120 || 101.6 101.0 -0.6000 -0.5906
NOTE: The MAXPRINT=2 printing limit has been reached
for the variable WEIGHTKG.
No more values will be printed for this comparison.
PRINTING RELATED INFORMATION
TOGETHER.
Sometimes it is desired to see the information with discrepancies
grouped together by the variables in the ID statement.
proc compare data=demog
compare=compare
transpose
;
id unique ;
run;
Unique=20110:
Variable Base Value Compare Diff. % Diff
WEIGHTKG 85.7 85.0 -0.700000 -0.816803
dob 09/17/1949 09/14/1949 -3.000000 0.079830
Unique=20120:
Variable Base Value Compare Diff. % Diff
WEIGHTKG 101.6 101.0 -0.600000 -0.590551
dob 08/19/1949 08/16/1949 -3.000000 0.079218
Unique=20130:
Variable Base Value Compare Diff. % Diff
WEIGHTKG 94.3 94.0 -0.300000 -0.318134
dob 06/02/1934 05/30/1934 -3.000000 0.032106
Unique=20202:
Variable Base Value Compare Diff. % Diff
WEIGHTKG 99.3 99.0 -0.300000 -0.302115
dob 10/17/1931 10/14/1931 -3.000000 0.029118
COMPARING VARIABLES WITH DIFFERENT
NAMES.
Sometimes, it is necessary to compare datasets when it is known
that the variable names are different yet they should have the
same values.
Proc sql;
alter table compare
add nage num label "Numeric age";
alter table demog
add age num label "New Numeric age";
update demog set
age=int((date()-dob)/365.25);
update compare set
nage=int((date()-dob)/365.25);
quit;
proc compare data=demog
compare=compare;
id unique;
var age;
with nage;
run;
# of Vars in Common: 3.
# of Vars in WORK.DEMOG but not in WORK.COMPARE:
1.
# of Vars in WORK.COMPARE but not in WORK.DEMOG: 1.
# of ID Vars: 1.
# of VAR Statement Vars: 1.
# of WITH Statement Vars: 1.
Sometimes, there are differences in the datasets that are not
important for the purpose of the comparison. For this scenario, it
is possible to give a value for which only observations that are
bigger than this number will be marked as a difference.
proc sql;
update compare set nage=nage+0.001;
quit;
proc compare data=demog
compare=compare
criterion=.01;
var age;
with nage;
run;
The COMPARE Procedure
Comparison of WORK.DEMOG with WORK.COMPARE
(Method=RELATIVE(2.21E-12), Criterion=0.01)
Variables Summary
Number of Variables in Common: 3.
Number of Variables in WORK.DEMOG but not in
WORK.COMPARE: 1.
Number of Variables in WORK.COMPARE but not in
WORK.DEMOG: 1.
Number of VAR Statement Variables: 1.
Number of WITH Statement Variables: 1.
Number of Observations in Common: 4.
Total Number of Observations Read from WORK.DEMOG: 4.
Total Number of Observations Read from WORK.COMPARE:
4.
Number of Observations with Some Compared Variables
Unequal: 0.
Number of Observations with All Compared Variables
Equal: 4.
Values Comparison Summary
Number of Variables Compared with All Observations
Equal: 1.
Number of Variables Compared with Some Observations
Unequal: 0.
Total Number of Values which Compare Unequal: 0.
Total Number of Values not EXACTLY Equal: 4.
Maximum Difference Criterion Value: 0.000018868.
CREATING AN OUTPUT DATASET
Sometimes, it is desirable to create an output dataset that
contains only the equalities or discrepancies. Here, there is an
example to create a dataset that has the differences.
Theoutnoequal option, specifies that only the observations with
discrepancies will exist in the dataset toprint.
proc compare data=demog
compare=compare
outnoequal
out=toprint
;

id unique ;
run;
Proc Print data=toprint;
run;
Variables with Unequal Values
Variable Type Len Label Ndif MaxDif
WEIGHTKG NUM 8 Weight in kg 4 0.700
Value Comparison Results for Variables
_______________________________________________________
||Weight in kg
|| Base Compare
Unique || WEIGHTKG WEIGHTKG Diff. % Diff
_________ || ________ ________ ________ _________
||
20110 || 85.7 85.0 -0.7000 -0.8168
20120 || 101.6 101.0 -0.6000 -0.5906
20130 || 94.3 94.0 -0.3000 -0.3181
20202 || 99.3 99.0 -0.3000 -0.3021
______________________________________________________
OUTPUT OF PROC PRINT
Obs _TYPE_ _OBS_ Unique WEIGHTKG dob
1 DIF 1 20110 -0.7 E
2 DIF 2 20120 -0.6 E
3 DIF 3 20130 -0.3 E
4 DIF 4 20202 -0.3 E
It is possible to use the noprint option, but then it is strongly
recommended to use the options OUTCOMP AND OUTBASE to
differentiate the variables.
proc compare data=demog
compare=compare
outnoequal
out=toprint
noprint
outcomp
outbase;
id unique ;
run;
proc print data=toprint;
run;
Obs _TYPE_ _OBS_ Unique WEIGHTKG dob
1 BASE 1 20110 85.7 09/17/1949
2 COMPARE 1 20110 85.0 09/17/1949
3 BASE 2 20120 101.6 08/19/1949
4 COMPARE 2 20120 101.0 08/19/1949
5 BASE 3 20130 94.3 06/02/1934
6 COMPARE 3 20130 94.0 06/02/1934
7 BASE 4 20202 99.3 10/17/1931
8 COMPARE 4 20202 99.0 10/17/1931
A USEFUL DATASET TO REPORT
To get only the observations and variables that have
discrepancies, proc transpose can be a big help in reporting the
findings.
proc transpose data=toprint out=transp;
by unique _obs_;
id _type_;
run;
proc print data=transp(where=(base^=compare));
run;
Obs Unique _OBS_ _NAME_ _LABEL_ BASE COMPARE
1 20110 1 WEIGHTKG Weight in kg 85.7 85
3 20120 2 WEIGHTKG Weight in kg 101.6 101
5 20130 3 WEIGHTKG Weight in kg 94.3 94
7 20202 4 WEIGHTKG Weight in kg 99.3 99
CONCLUSION
You are the person who decides what is important to validate,
values of the observations and how different can they be. Is it
important to have the same labels, formats and informats in a
variable. Proc Compare is a tool that is flexible to allow you find
the differences that are important to you.
ACKNOWLEDGMENTS
I want to thank, Craig Mauldin, Andy Barnett, George Clark,
Bonnie Duncan and Kim Sturgen for their comments and support.
CONTACT INFORMATION
Your comments and questions are valued and encouraged.
Contact the author at:
A. Cecilia Casas
PPD Development
3900 Paramount Parkway
Morrisville, NC 27560
Phone: (919) 462-4199
Fax: (919) 379-6151
Email: [email protected]
SAS and all other SAS Institute Inc. product or service names are
registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their
respective companies.
Tags