Comparing SAS Files

LauraASchild 735 views 41 slides Jul 11, 2015
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

No description available for this slideshow.


Slide Content

Comparing SAS Files Laura Schild

Consider using PROC COMPARE the next time you need to… prepare to combine two data sets, so you know what variables need to be reformatted, etc. evaluate newly collected data in comparison to an existing file (ex. CRANE Study / Healthcore files) test whether program revisions have occurred as expected examine whether two algorithms for computing certain variables produce comparable results

PROC COMPARE with No Options proc compare base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201201 ; title1 'PROC COMPARE with No Options' ; run ;

PROC COMPARE with No Options (continued)

PROC COMPARE with No Options (continued)

PROC COMPARE with No Options (continued)

PROC COMPARE with No Options (continued)

PROC COMPARE with No Options (continued)

Comparing Contents Only When all you really care about is the contents, add the NoValues and ListVar options. NOVALUES suppresses the report of the value comparison results. NOSUMMARY suppresses the data set, variable, observation, and values comparison summary reports. LISTVAR lists all variables that are found in only one data set. WARNING displays a warning message in the SAS log when differences are found. proc compare novalues nosummary listvar warning base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201201 ; title1 'PROC COMPARE with NoValues , NoSummary and ListVar Options' ; run ;

Comparing Contents Only (continued)

Comparing Contents Only (continued)

Comparing Contents Only (continued)

The Warning Option Note: The NoValues and NoSummary options suppress printing reports, but SAS still compares the records. Consequently, when you use the Warning option, you will still get warnings even if the contents are identical.

PROC COMPARE Identical Files with Warning Option *-------------------------------------------------------------------*; * PROC COMPARE Identical Files with Warning Option *; *-------------------------------------------------------------------*; data TrackingSystem_BR_201112_Copy; set TrackingSystem_BR_201112; run ; options pageno = 1 ; proc compare warning base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201112_Copy ; title1 'PROC COMPARE Identical Files with Warning Option' ; run ;

PROC COMPARE Identical Files with Warning Option (output)

PROC COMPARE Identical Files Sorted Differently *-------------------------------------------------------------------*; * PROC COMPARE with Warning Option *; * - Identical Files Sorted Differently *; *-------------------------------------------------------------------*; proc sort data =TrackingSystem_BR_201112; by MemberNumber ; run ; proc sort data =TrackingSystem_BR_201112_copy; by LastName ; run ; options pageno = 1 ; proc compare warning base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201112_copy ; title1 'PROC COMPARE with Warning Option - Identical Files Sorted Differently' ; run ;

PROC COMPARE Identical Files Sorted Differently (Log) 56 57 options pageno =1; 58 proc compare warning 59 base = TrackingSystem_BR_201112 60 compare = TrackingSystem_BR_201112_copy 61 ; 62 title1 'PROC COMPARE with Warning Option - Identical Files Sorted Differently'; 63 run; WARNING: Values of the following 18 variables compare unequal: MemberNumber FirstName LastName PhoneNumber Gender BirthDate StreetAddress StreetAddress2 City ZipCode NumberOfMedications_BR DrugName1_BR DrugName2_BR PrescriptionNumber1_BR PrescriptionNumber2_BR TransferNumberLive Schedule TransferNumberScript WARNING: The data sets WORK.TRACKINGSYSTEM_BR_201112 and WORK.TRACKINGSYSTEM_BR_201112_COPY contain unequal values. NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112. NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112_COPY. NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the "BEST" format. NOTE: PROCEDURE COMPARE used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

PROC COMPARE Identical Files Sorted Differently (Output)

PROC COMPARE Identical Files Sorted Differently ( Output, page 2)

PROC COMPARE Identical Files Sorted Differently (Output, page 19) For each variable, you get a list similar to the following. Full output not included due to patient confidentiality and due to length.

PROC COMPARE by ID With MaxPrint Option (code) *-------------------------------------------------------------------------------*; * PROC COMPARE by ID with ListObs and MaxPrint Options *; *-------------------------------------------------------------------------------*; libname local 'G:\Investigators\Davis\Davis_CRANE (SJS Antiepileptic)\SAS Programs\Data' ; libname kpga 'G:\Investigators\Davis\Davis_CRANE (SJS Antiepileptic)\SAS Programs\Data\KPGA' ; proc sort data =kpga.CRANE01v3_KPGA_Cohort_NoID; by StudyID Cohort; proc sort data =local.CRANE01v5_KPGA_Cohort_NoID; by StudyID Cohort; run ; options pageno = 1 ; PROC COMPARE ListVar ListObs MaxPrint =( 10 , 500 ) BASE = kpga.CRANE01v3_KPGA_Cohort_NoID COMPARE = local.CRANE01v5_KPGA_Cohort_NoID LISTOBS ; ID StudyID Cohort; title1 'PROC COMPARE by ID with ListVar , ListObs , and MaxPrint Options' ; RUN ;

MaxPrint Option LISTOBS lists all observations that are found in only one data set. MAXPRINT= total | ( per-variable, total ) specifies the maximum number of differences to print, where total is the maximum total number of differences to print. The default value is 500 unless you use the ALLOBS option (or both the ALLVAR and TRANSPOSE options), in which case the default is 32000. per-variable is the maximum number of differences to print for each variable within a BY group. The default value is 50 unless you use the ALLOBS option (or both the ALLVAR and TRANSPOSE options), in which case the default is 1000. The MAXPRINT= option prevents the output from becoming extremely large when data sets differ greatly.

PROC COMPARE by ID With MaxPrint Option (output)

Comparing Multiple Files

Run PROC CONTENTS Run a PROC CONTENTS on each file to be included in the comparison, keeping the variable name, type, and length. *-------------------------------------------------------------------------------*; * Compare the variable names/formats on each of the Tracking System BR files. *; *-------------------------------------------------------------------------------*; proc contents data =qa.TrackingSystem_BR_201112 noprint out =TS_BR_201112 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201201 noprint out =TS_BR_201201 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201202 noprint out =TS_BR_201202 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201203 noprint out =TS_BR_201203 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201204 noprint out =TS_BR_201204 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201205 noprint out =TS_BR_201205 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201206 noprint out =TS_BR_201206 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201207 noprint out =TS_BR_201207 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201208 noprint out =TS_BR_201208 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201209 noprint out =TS_BR_201209 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201210 noprint out =TS_BR_201210 (keep=NAME TYPE LENGTH); proc contents data =qa.TrackingSystem_BR_201211 noprint out =TS_BR_201211 (keep=NAME TYPE LENGTH); run ;

Get All Variable Names Combine all of the files, keeping only the variable name, and then eliminate duplicates. data TS_BR_AllNames (keep=name); set TS_BR_201112 TS_BR_201201 TS_BR_201202 TS_BR_201203 TS_BR_201204 TS_BR_201205 TS_BR_201206 TS_BR_201207 TS_BR_201208 TS_BR_201209 TS_BR_201210 TS_BR_201211 ; run ; proc sort data = TS_BR_AllNames nodupkey ; by name; run ;

Combine All Contents Using SQL proc sql ; create table TS_BR_AllContents as select a.name , b.type as type_201112 , c.type as type_201201 , d.type as type_201202 , e.type as type_201203 , f.type as type_201204 , g.type as type_201205 , h.type as type_201206 , i.type as type_201207 , j.type as type_201208 , k.type as type_201209 , l.type as type_201210 , m.type as type_201211 , b.length as length_201112 , c.length as length_201201 , d.length as length_201202 , e.length as length_201203 , f.length as length_201204 , g.length as length_201205 , h.length as length_201206

Combine All Contents Using SQL , i.length as length_201207 , j.length as length_201208 , k.length as length_201209 , l.length as length_201210 , m.length as length_201211 from TS_BR_AllNames as a left join TS_BR_201112 as b on a.name = b.name left join TS_BR_201201 as c on a.name = c.name left join TS_BR_201202 as d on a.name = d.name left join TS_BR_201203 as e on a.name = e.name left join TS_BR_201204 as f on a.name = f.name left join TS_BR_201205 as g on a.name = g.name left join TS_BR_201206 as h on a.name = h.name left join TS_BR_201207 as i on a.name = i.name left join TS_BR_201208 as j on a.name = j.name left join TS_BR_201209 as k on a.name = k.name left join TS_BR_201210 as l on a.name = l.name left join TS_BR_201211 as m on a.name = m.name ; quit ;

Report Variables Not On All Files proc print data = TS_BR_AllContents ; where type_201112 = . or type_201201 = . or type_201202 = . or type_201203 = . or type_201204 = . or type_201205 = . or type_201206 = . or type_201207 = . or type_201208 = . or type_201209 = . or type_201210 = . or type_201211 = . ; title3 'Variable Name Not On All Tracking System BR Files' ; run ;

Report Variables Not On All Files

Correct Discrepancies If one or more files contain variables that they should not have (e.g. HRN on the 201201 file), drop those variables. If one or more files are missing variables that they should have, add them, by: linking to another source (if available) hardcoding a value adding dummy variables w/ missing values getting a replacement file if necessary If there are discrepancies in variable names, you might see both “extra” variables and “missing” variables that need to be fixed by renaming them. M ake corrections and rerun the PROC CONTENTS and PROC SQL. You might also want to rerun the discrepancy report to make sure you fixed the discrepancies correctly.

Report Variable Type Discrepancies *-------------------------------------------------------------------------------*; * Check variable types for consistency. *; *-------------------------------------------------------------------------------*; proc print data = TS_BR_AllContents ; where type_201112 ne type_201211 or type_201201 ne type_201211 or type_201202 ne type_201211 or type_201203 ne type_201211 or type_201204 ne type_201211 or type_201205 ne type_201211 or type_201206 ne type_201211 or type_201207 ne type_201211 or type_201208 ne type_201211 or type_201209 ne type_201211 or type_201210 ne type_201211 ; title3 'Variable Type Discrepancy On Tracking System BR Files' ; run ;

Report Variable Type Discrepancies

Correct Type Discrepancies Converting Character to Numeric %macro CorrectType ( file,var,tempvar ); data TrackingSystem_BR_&file (drop=& tempvar ); set TrackingSystem_BR_&file (rename=(& var =& tempvar )); & var = & tempvar * 1 ; run; %mend CorrectType ; % CorrectType ( 201201 ,PrescriptionNumber2_BR,RxNum2_BR_Char); % CorrectType ( 201203 ,PrescriptionNumber2_BR,RxNum2_BR_Char); % CorrectType ( 201204 ,PrescriptionNumber2_BR,RxNum2_BR_Char); Then rerun the PROC CONTENTS and PROC SQL. You might also want to rerun the discrepancy reports to make sure you fixed the type discrepancies correctly.

Report Variable Length Discrepancies *-------------------------------------------------------------------------------*; * Check variable lengths for consistency. *; *-------------------------------------------------------------------------------*; data TS_BR_AllContents ; set TS_BR_AllContents ; length_max = max(length_201112,length_201201,length_201202,length_201203, length_201204,length_201205,length_201206,length_201207, length_201208,length_201209,length_201210,length_201211); run ; proc print data = TS_BR_AllContents ; where length_201112 ne length_201211 or length_201201 ne length_201211 or length_201202 ne length_201211 or length_201203 ne length_201211 or length_201204 ne length_201211 or length_201205 ne length_201211 or length_201206 ne length_201211 or length_201207 ne length_201211 or length_201208 ne length_201211 or length_201209 ne length_201211 or length_201210 ne length_201211 ; var name length_max length_2: type: ; title3 'Variable Length Discrepancy On Tracking System BR Files' ; run ;

Report Variable Length Discrepancies

Combine Files Keeping Longest Variable Lengths *-------------------------------------------------------------------------------*; * Combine all of the Tracking System BR files. *; *-------------------------------------------------------------------------------*; data TrackingSystem_BR_All ; length City $ 16 DrugName2_BR $ 21 FirstName $ 11 LastName $ 17 PRODUCTID $ 14 PreviouslyAsked $ 4 StreetAddress $ 30 StreetAddress2 $ 28 ; set TrackingSystem_BR_201112 (in=a) TrackingSystem_BR_201201 (in=b) TrackingSystem_BR_201202 (in=c) TrackingSystem_BR_201203 (in=d) TrackingSystem_BR_201204 (in=e) TrackingSystem_BR_201205 (in=f) TrackingSystem_BR_201206 (in=g) TrackingSystem_BR_201207 (in=h) TrackingSystem_BR_201208 (in= i ) TrackingSystem_BR_201209 (in=j) TrackingSystem_BR_201210 (in=k) TrackingSystem_BR_201211 (in=l) ;
Tags