Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Match Merging in SAS

8,060 views

Published on

Learning
Base SAS,
Advanced SAS,
Proc SQl,
ODS,
SAS in financial industry,
Clinical trials,
SAS Macros,
SAS BI,
SAS on Unix,
SAS on Mainframe,
SAS interview Questions and Answers,
SAS Tips and Techniques,
SAS Resources,
SAS Certification questions...

visit http://sastechies.blogspot.com

Published in: Technology, Business

Data Match Merging in SAS

  1. 1. SASTechies [email_address] http://www.sastechies.com
  2. 2. data finance.duejan; set finance.loans; Interest=amount*(rate/12); run; SAS Data Set Finance.Loans 11/13/09 SAS Techies 2009 Account Amount Rate Months Payment 101-1092  22000 0.1000     60   467.43 101-1731 114000  0.0950   360   958.57 101-1289  10000   0.1050     36   325.02 101-3144    3500  0.1050     12   308.52
  3. 3. <ul><li>Each time the SET statement is executed, SAS reads one observation into the program data vector. SET reads all variables and all observations from the input data sets unless you tell SAS to do otherwise. A SET statement can contain multiple data sets; a DATA step can contain multiple SET statements. </li></ul><ul><li>SET < SAS-data-set(s) <( data-set-options(s) )>> < options >; </li></ul>11/13/09 SAS Techies 2009
  4. 4. SAS Techies 2009 data lab23.drug1h; set research.cltrials; if placebo='YES' ; run; data lab23.drug1h; set research.cltrials; Where placebo='YES' ; run; data lab23.drug1h; set research.cltrials ( Where=( placebo='YES‘)) ; run; data lab23.drug1h; set A C; run; 11/13/09
  5. 5. <ul><li>data lab23.drug1h(drop=placebo); </li></ul><ul><li>set research.cltrials (drop=triglycerides uricacid) ; </li></ul><ul><li>if placebo='YES'; </li></ul><ul><li>run; </li></ul><ul><li>data lab23.drug1h(drop=placebo) ; </li></ul><ul><li>set research.cltrials </li></ul><ul><li>(drop=triglycerides uricacid placebo); </li></ul><ul><li>if placebo='YES'; </li></ul><ul><li>run; </li></ul><ul><li>If you don't process certain variables and you don't want them to appear in the new data set, specify them in the DROP= option in the SET statement. </li></ul><ul><li>If you do need to process a variable in the original data set (in a subsetting IF statement, for example), you must specify the variable in the DROP= option in the DATA statement. Otherwise, the statement that is using the variable for processing causes an error. </li></ul>SAS Techies 2009 11/13/09
  6. 6. SAS Techies 2009 Proc sort data=a;by num; Proc sort data=b;by num; data sharad; merge a b; by num; run; data sharad; set a b; run; 11/13/09
  7. 7. <ul><li>The DATA step provides a large number of other programming features for manipulating data sets. For example, you can </li></ul><ul><ul><li>use IF-THEN/ELSE logic to control processing based on one or more conditions </li></ul></ul><ul><ul><li>specify additional data set options </li></ul></ul><ul><ul><li>perform calculations </li></ul></ul><ul><ul><li>create new variables </li></ul></ul><ul><ul><li>process variables in arrays </li></ul></ul><ul><ul><li>use SAS functions </li></ul></ul><ul><ul><li>use special variables such as FIRST. and LAST. to control processing. </li></ul></ul><ul><ul><li>You can also combine SAS data sets in other ways, including match merging, interleaving, one-to-one merging, and updating. </li></ul></ul>SAS Techies 2009 11/13/09
  8. 8. <ul><li>DATA output-SAS-data-set ; </li></ul><ul><li>MERGE   SAS-data-set-1 SAS-data-set-2 ; </li></ul><ul><li>BY variable(s) ; </li></ul><ul><li>RUN; </li></ul><ul><li>produces an output data set that contains values from all observations in all input data sets . </li></ul><ul><li>In DATA step match-merging, all data sets to be merged must be sorted or indexed by the values of BY variable </li></ul><ul><li>The common variable must have the same type and length in all data sets to be merged. </li></ul>SAS Techies 2009 11/13/09        You can specify any number of input data sets in the MERGE statement.
  9. 9. <ul><li>PROC SORT  < DATA= SAS-data-set > < OUT= SAS-data-set > <options> ;    </li></ul><ul><li>BY variable(s) ; </li></ul><ul><li>RUN; </li></ul><ul><li>Interesting options </li></ul><ul><li>-nodupkey </li></ul><ul><li>-noduprecs </li></ul><ul><li>-where statement </li></ul><ul><li>Note: If you don't use the OUT= option, PROC SORT permanently sorts the data set specified in the DATA= option </li></ul>SAS Techies 2009 11/13/09 Obscc ID Age Sex Date 1 A001 21 m 05/22/75 2 A001 21 m 05/22/75 3 A003 24 f 08/17/72 4 A004 .   03/27/69 5 A005 44 f 02/24/52 6 A007 39 m 11/11/57 Obs ID Age Sex Date 1 A001 21 m 05/22/75 2 A001 32 m 06/15/63 3 A003 24 f 08/17/72 4 A004 .   03/27/69 5 A005 44 f 02/24/52 6 A007 39 m 11/11/57
  10. 10. <ul><li>data clinic.combined; </li></ul><ul><li>merge clinic.demog </li></ul><ul><li>(rename=(date=BirthDate)) </li></ul><ul><li>clinic.visit </li></ul><ul><li>(rename=(date=VisitDate)) ; </li></ul><ul><li>by id; </li></ul><ul><li>If Birthdate = ’05Mar2005’d ; </li></ul><ul><li>Rename birthdate=somedate; </li></ul><ul><li>run; </li></ul><ul><li>Note: when you rename you should be using the new name in that datastep. </li></ul><ul><li>(RENAME=( old-variable-name = new-variable-name ))     where </li></ul><ul><ul><li>the RENAME= option, in parentheses, follows the name of each data set that contains one or more variables to be renamed </li></ul></ul><ul><ul><li>old-variable-name names the variable to be renamed </li></ul></ul><ul><ul><li>new-variable-name specifies the new name for the variable. </li></ul></ul><ul><li>You can rename any number of variables in each occurrence of the RENAME= option. </li></ul>SAS Techies 2009 11/13/09
  11. 11. <ul><li>data combined; </li></ul><ul><li>merge clients (in=A) Amounts (in=B) ; </li></ul><ul><li>by Name; </li></ul><ul><li>If A and B; </li></ul><ul><li>run; </li></ul><ul><li>Note:If the expression is true for the observation, the current observation is written to the output data set. </li></ul><ul><li>(IN= variable )    where </li></ul><ul><ul><li>the IN= option, in parentheses, follows the data set name </li></ul></ul><ul><ul><li>variable names the variable to be created. </li></ul></ul><ul><li>the IN= data set option to create and name a variable that indicates whether the data set contributed data to the current observation </li></ul><ul><li>the subsetting IF statement to check the IN= values and output only those observations that appear in the data sets for which IN= is specified. </li></ul>SAS Techies 2009 11/13/09
  12. 12. <ul><li>The Compilation Phase: Setting Up the New Data Set </li></ul><ul><li>To prepare to merge data sets, SAS software </li></ul><ul><ul><li>reads the descriptor portions of data sets listed in the MERGE statement </li></ul></ul><ul><ul><li>reads the remainder of the DATA step program </li></ul></ul><ul><ul><li>creates the program data vector (PDV) </li></ul></ul><ul><ul><li>assigns a tracking pointer to each data set listed in the MERGE statement. </li></ul></ul><ul><ul><li>If variables with the same name appear in more than one data set, the variable from the first data set that contains the variable (in the order listed in the MERGE statement) determines the length of the variable. </li></ul></ul>SAS Techies 2009 11/13/09
  13. 13. <ul><li>The Execution Phase: </li></ul><ul><li>After compiling the DATA step, SAS </li></ul><ul><li>software sequentially match-merges </li></ul><ul><li>observations by moving the pointers down </li></ul><ul><li>each observation of each data set and </li></ul><ul><li>checking to see whether the BY values </li></ul><ul><li>match . </li></ul><ul><ul><li>If Yes , the observations are written to the PDV in the order the data sets appear in the MERGE statement. (Remember that values of any like-named variable are overwritten by values of the like-named variable in subsequent data sets.) SAS software writes the combined observation to the new data set and retains the values in the PDV until the BY value changes in all the data sets. </li></ul></ul>SAS Techies 2009 11/13/09
  14. 14. <ul><li>If No , SAS software determines which of the values comes first and writes the observation containing this value to the PDV. Then the observation is written to the new data set. </li></ul>SAS Techies 2009 11/13/09
  15. 15. <ul><li>When the BY value changes in all the input data sets, the PDV is initialized to missing. </li></ul><ul><li>The DATA step merge continues to process every observation in each data set until it exhausts all observations in all data sets. </li></ul>SAS Techies 2009 11/13/09
  16. 16. <ul><li>Handling Unmatched Observations and Missing Values By default, all observations written to the PDV, including observations with missing data and no matching BY values, are written to the output data set. (If you specify a subsetting IF statement to select observations, only those that meet the IF condition are written.) </li></ul><ul><li>If an observation contains missing values for a variable , the observation in the output data set contains the missing values as well. Observations with missing values for the BY variable appear at the top of the output data set. </li></ul><ul><li>If an input data set doesn't have any observations for a given value of the common variable, the observation in the output data set contains missing values for the variables unique to that input data set. </li></ul>SAS Techies 2009 11/13/09
  17. 17. SAS Techies 2009 11/13/09
  18. 18. <ul><li>The DATA step provides a large number of other programming features for manipulating data sets during match-merging. For example, you can </li></ul><ul><li>use IF-THEN/ELSE logic to control processing based on one or more conditions </li></ul><ul><li>specify additional data set options </li></ul><ul><li>perform calculations </li></ul><ul><li>create new variables </li></ul><ul><li>process variables in arrays </li></ul><ul><li>use SAS functions </li></ul><ul><li>use special variables such as FIRST. and LAST. to control processing. </li></ul>SAS Techies 2009 11/13/09
  19. 19. <ul><li>options pageno=1 nodate linesize=80 pagesize=60; </li></ul><ul><li>data testfile; </li></ul><ul><li>Set some; </li></ul><ul><li>by Drug Rx; </li></ul><ul><li>If first.Drug then TRx=0; </li></ul><ul><li>TRx+Rx; </li></ul><ul><li>If last.Drug then output; </li></ul><ul><li>Run; </li></ul><ul><li>Drug Rx </li></ul><ul><li>A 10 Output Testfile </li></ul><ul><li>A 11 Drug TRx </li></ul><ul><li>B 11 A 21 </li></ul><ul><li>B 12 B 23 </li></ul><ul><li>When an observation is the first in a BY group, SAS sets the value of FIRST. variable to 1 for the variable whose value changed, as well as for all of the variables that follow in the BY statement. For all other observations in the BY group, the value of FIRST. variable is 0. </li></ul><ul><li>Likewise, if the observation is the last in a BY group, SAS sets the value of LAST. variable to 1 for the variable whose value changes on the next observation, as well as for all of the variables that follow in the BY statement. For all other observations in the BY group, the value of LAST. variable is 0. For the last observation in a data set, the value of all LAST. variable variables are set to 1. </li></ul>SAS Techies 2009 11/13/09 FIRST.Drug FIRST.Rx LAST.Drug LAST.Rx 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1
  20. 20. <ul><li>System options apply to the datasets, output for the entire session. </li></ul><ul><li>Can be overridden by Dataset options </li></ul><ul><li>Can be declared anywhere except within Datalines/cards statements </li></ul><ul><li>Ex: options compress=yes, obs=max </li></ul><ul><li>Ex: </li></ul><ul><li>Options compress=no obs=max; </li></ul><ul><li>data new; </li></ul><ul><li>Set cool (obs=10,compress=yes); </li></ul><ul><li>Run; </li></ul><ul><li>Dataset options applies to that particular dataset only. </li></ul><ul><li>CANNOT be overridden by system options. </li></ul><ul><li>Can be declared only with the dataset options </li></ul><ul><li>Ex: </li></ul><ul><li>data new; </li></ul><ul><li>Set cool (obs=10,compress=yes); </li></ul><ul><li>Run; </li></ul>SAS Techies 2009 11/13/09
  21. 21. <ul><li>GOTO label; </li></ul><ul><li>The GOTO statement tells SAS to jump immediately to the statement label that is indicated in the GOTO statement and to continue executing statements from that point until a RETURN statement is executed. </li></ul><ul><li>A RETURN statement after a GO TO statement returns execution to the beginning of the next DATA step iteration </li></ul><ul><li>LINK label ; </li></ul><ul><li>The LINK statement tells SAS to jump immediately to the statement label that is indicated in the LINK statement and to continue executing statements from that point until a RETURN statement is executed. </li></ul><ul><li>The RETURN statement sends program control to the statement immediately following the LINK statement. </li></ul>SAS Techies 2009 11/13/09
  22. 22. SAS Techies 2009 LINK Statement data hydro; input type $ depth station $; if type ='aluv' then link calcu; date=today(); return; calcu: if station='site_1' then elevatn=6650-depth; else if station='site_2' then elevatn=5500-depth; return; datalines; aluv 523 site_1 uppa 234 site_2 aluv 666 site_2 ... more data lines ... ; Goto Statement data info; input x; if 1<=x<=5 then goto add; put x=; return; add: sumx+x; return; datalines; 7 4 323 ; Run; 11/13/09

×