PROC SQL - What Does It Offer Traditional SAS® Programming?                                          Ian Whitlock, Westat ...
It is easy to see how to produce the report with aIn all the remaining example code the PROC                   DATA _NULL_...
where = and                                                 ( select distinct school          date is between be...
means that the usefulness of SQL is highly                  In the section on the Cartesian product, wedependent on how we...
Macro - SQL Interaction                                          2.    Find the minimum of all MINGROUP                   ...
%local i done ;                                                     ;proc sql ;                                           ...
Upcoming SlideShare
Loading in...5

Proc sql tips


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Proc sql tips

  1. 1. PROC SQL - What Does It Offer Traditional SAS® Programming? Ian Whitlock, Westat Inc.Abstract Matching multiple data sets at differentPROC SQL plays two important roles in SAS: levels 1. To connect SAS with other data base PROC SQL provides a powerful tool when extracting systems data from different data sets at several different levels. It not only provides simpler code, it provides 2. To help the typical SAS data processing a new way of looking at these problems. programmer Suppose we have data at the state, county and cityThis paper looks at the second role. SAS program- level stored in three data sets.ming problems are presented, where PROC SQL hasa distinct advantage over SAS programs that do not State (state, region,...)use this procedure. By focusing on such problems county (cntyid, state, cnty, area,with coded solutions I hope to answer the question, ...)"When is it most appropriate to consider using PROC city (city, cntyid, area, pop,...)SQL in a SAS program?" Prepare a report of all cities in the midwest with populations over 100,000 with the ratio of the cityIntroduction area to the enclosing county area.The typical SAS programmer needs PROC SQL A SAS procedural solution demands that we decidebecause: whether to start with states or cities, specify all sorts needed for the various DATA step merges, and 1. It is superb at accessing data stored in specify those merges in detail ending up with a multiple data sets at different levels. PROC PRINT. 2. It can easily produce a Cartesian product. In contrast, SQL asks the fundamental questions: 3. It can perform matching where the condition of a match is not equality. 1. What are the data sets? 2. What are the subsetting conditions? 4. It is good at summarization. 3. What are the linking conditions? 5. With the introduction of 6.11, it can make 4. What columns should appear? arrays of macro variables or do away with the need for these arrays by assigning a proc sql ; whole column of values to one macro select st.state , variable. cn.cnty , , 6. Macro - SQL interaction enhances both ct.area / cn.area as arearato macro and SQL. from state as st , county as cn ,In addition to the direct values listed above, one city as ctshould not underestimate the value of SQL training in where ct.pop > 100000 andteaching one data organization. An example of how st.region = MW andSQL can teach data organization is given at the end st.state = cn.state andof the summarizing section. cn.cntyid = ct.cntyid ; quit ; 1
  2. 2. It is easy to see how to produce the report with aIn all the remaining example code the PROC DATA _NULL_ step when the right information is in astatement and the QUIT will be omitted. data set ( VARIABLE, FORMAT, VALUE, LABEL, and COUNT ). Here is the SQL code to produce the file.Cartesian Product create table report asCartesian product matches are far more common selectthan one-to-one matches, but the MERGE statement coalesce (sfm.variable, fq.variable)assumes one-to-one within BY-groups. To find a as variable ,Cartesian product match, lets look at a codebook coalesce ( sfm.format , fq.format )example. I have three data sets: as format , coalesce ( sfm.value , fq.value ) specs ( variable , format ) as value , freq ( variable, format, value, count ) sfm.label , fmts ( format, value, label ) coalesce ( fq.count , 0 ) as countThe report might look something like this. from sfm full join freq as fq on sfm.variable = fq.variable and first variable using fmt1name format sfm.format = fq.format and sfm.value = fq.value value label count order by variable , format , value 1 first 500 ; 2 sec 0 3 rem 300 Note that in two SQL statements we have done a lot of the work toward creating a codebook. If one could second variable using fmt2name format produce SPECS, FREQ, and FMTS easily, then one could produce a codebook for any properly formatted etc. SAS data set. The FMTS file is trivially produced with the FMTLIB option of PROC FORMAT. TheBefore tackling the problem lets look at the code for FREQ file requires some macro code. We willjoining SPECS and FMTS. The problem involves a postpone discussion of the SPECS file to a laterCartesian product because a format may be section.associated with more than one variable in SPECS,and formats typically have more than one value.Thus FORMAT does not determine a single record in Fuzzy Matchingeither data set; hence a merge by FORMAT will notwork. Note that accomplishing this sort of combining Fuzzy matching comes in two varieties. In date (orrecords in a DATA step involves using sophisticated time) line matches one file holds a specific date (orSAS techniques when SQL is not used. time) and one wants the corresponding record which holds a range of dates (or time). For SAS dates create view sfm as DATE, BEGDATE, and ENDDATE the WHERE select s.variable , clause might be s.format , fm.value , where date is fm.label between begdate and enddate from specs as s , fmts as fm where s.format = fm.format For efficiency reasons it is important to add an equi- ; condition whenever possible. In date (or time) matches one often has an ID that must also match, hence the equi-join condition becomes 2
  3. 3. where = and ( select distinct school date is between begdate and enddate from studsamp group by schoolIn the other kind of fuzzy matching one cannot trust having wght/sum(wght) > .2the identifying variables. Suppose we want to match ) as wanton social security numbers, SSN, but expect where = want.schooltransposition errors and single digit mutations. Now order by, stu.wght descthe WHERE clause might be ; where sum ( substr(a.SSN,1,1) = The technique is important because there are many substr(b.SSN,1,1) , times one wants to view every one in a group if substr(a.SSN,2,1) = anyone in the group has some property. SQL substr(b.SSN,2,1) , provides a natural idiom for producing the report. .... substr(a.SSN,9,1) = Knowing SQL should make one more sensitive to substr(b.SSN,9,1) bad patterns of storing data. For example, a ) >= 7 common question on SAS-L is how to array data. Given the dataTo make this an equi-join we might add ID DATE COUNT and substr(,1,3) = 1 5jun1993 50 substr(,1,3) 1 16oct1993 25 1 21dec1993 8or some other relatively safe blocking variable. 2 14may1990 16 2 27jan1991 3Summarizing how do you produce one record per ID with as many date and count fields as needed, say ID, DATE1 -One PROC SQL step can do the job of a PROC DATE32 and COUNT1 - COUNT32? AnotherSUMMARY followed by merging of the results with common question is how to work with the arrayedthe original data. For example, suppose we have a data. For example, how can you compute the rate ofweighted student sample including many different decrease in count per month and per year for eachschools. We want the percentage weight of each ID. The answer is a trivial SQL problem, when thestudent in a school. Then we might have: data are stored as they were originally given. select school , student , wght , select id , 100*wght/sum(wght) as pctwt (max(count)-min(count))/ from studsamp intck(month,min(date),max(date)) group by school as decpmon, ; calculated decpmon * 12 as decpyr from origdataIn this case one gets a message that summary data group by idwas remerged with the original data, but that is ;precisely what we wanted. After arraying it becomes a harder problem. PerhapsNow suppose we want to look at all the students from if SAS programmers learned SQL, and how to solveany school which has some student contributing problems without arrays, then they would also learnmore than 20% of the weight. The code might be the advantages of storing data in a non-arrayed form. With SQL training, one comes to realize the select stu.* importance of putting the information into the data from studsamp as stu , instead of the variable names. Of course, this also 3
  4. 4. means that the usefulness of SQL is highly In the section on the Cartesian product, wedependent on how well the data are stored, but it postponed discussion of the data set SPECS. Itwould be wrong to conclude that one might as well could be generated from one of the "dictionary" filesavoid learning SQL because of bad data documented in the Technical Report practices. Suppose we are interested in making a codebook for the data set LIB.MYDATA, then the following code could generate SPECS.Macro Lists Via PROC SQL create specs asPROC SQLs ability to assign a whole column of select name as variable ,values to a macro variable has drastically changed casehow one writes macro code. Consider the splitting when format= and type=charproblem. Given a data set ALL with a variable SPLIT then $charnaming a member, split ALL by the variable SPLIT. when format= and type= numBefore version 6.11 one had to use CALL SYMPUT then bestto create an array of data set names and values and else formatthen write a monster SELECT statement. The whole end as formatthing had to be in a macro in order to repetitively from dictionary.columnsprocess the array. Now one might view it as a where libname = LIB andproblem to produce two lists memname = MYDATA ; 1. The names of data sets 2. WHEN / OUTPUT statements for a To prepare for doing the frequencies needed to make SELECT block the data set FREQ we could use the array form of generating variables from a column.The first case is easily handled by select variable , select distinct lib.||split format , into :datalist separated by into :var1 - var9999 , from all ; :fmt1 - fmt9999 from specsThe second is more of the same, only harder. ; %let nvar = &sqlobs ; select distinct when (||split||) output lib.|| The frequency data sets can then be generated in a split into :whenlist PROC FORMAT with the macro code separated by ; from all proc freq data = lib.mydata ; ; %do i = 1 %to &nvar ; table &&var&i / out=&&var&i ;Now the code to produce the split is trivial and need format &&var&i &&fmt&i... ;not even be housed in a macro. %end ; run ; data &datalist ; set all ; We still have not combined the frequency data sets select ( split ) ; into one data set, but that task can be left to a &whenlist ; competent macro programmer, even one who otherwise ; doesnt know SQL (assuming that that is not a end ; contradiction in terms). run ; 4
  5. 5. Macro - SQL Interaction 2. Find the minimum of all MINGROUP values for all names in a common groupThe previous section showed how the making of lists (e.g. GROUP = 2 has MINGROUP = 1).has had a dramatic effect on the way one codesmacro problems involving lists. Now we consider a PROC SQL is very suitable to both types ofmore complex interaction between PROC SQL and minimization. In the first case we might havemacro, where macro code is used to write the SQLcode in a loop and the whole problem is much easier, create table t asprecisely because it is SQL code. select name, group, min (group) as mingroupSuppose we have a data set, W, containing the from datasetvariables NAME and GROUP. group by name ; NAME GROUP In the second case we might have A 1 create table t as B 1 select name, group, B 2 min (mingroup) as mingroup C 2 from t D 2 group by group ; D 3 E 3 F 4 These two operations must be repeated over and G 4 over until no new minimums are found, since each G 5 new extension of a group may mean further H 5 collapsing. To express the iteration of this code to an arbitrary level, we need a macro %DO-loop. This time we will present the complete macro,We want to collapse groups to the lowest level. For %GROUPIT.example, since A and B belong to group 1, and B andC belong to group 2, then all members of group 2 are For generality, we make parameters to name thepart of group 1 because the groups have the input and output data sets, and the variablescommon member B. Once this is seen one can add represented by NAME, GROUP, and 3 to the new group 1 because of the common The parameter MAX is added to insure that themember D. Thus group 1 covers A, B, C, D and E. macro does not execute for an excessively long time.Similarly F, G, and H ultimately belong to group 4. (Since the algorithm does converge one could doMore formally, two groups are in the same chain if away with this parameter or set it to the number ofthere is a sequence of groups containing the given observations.)groups such that each consecutive pair of groups %macro groupitcontains a common name. Using this definition the ( data=&syslast, /* input data */data set consists of disjoint chains. The problem is out=_DATA_, /* resulting data */to write a program identifying each chain by the name=name, /* name variable */minimum group number in the chain. group=group, /* group variable */ mingroup=mingroup, /* minimum */The intuitive argument given in the previous max=20 /* limit #iterations */paragraph uses two kinds of minimization. ) ; 1. Find the minimum group (call it /* ------------------------------------ MINGROUP) for all names having the minimize group on name and then group same value (e.g. NAME = B has repeat until max iterations or done MINGROUP = 1). ------------------------------------ */ 5
  6. 6. %local i done ; ;proc sql ; %let done = /* ------------------------------ %eval ( not &sqlobs ) ; initial set up - get first reset print ; minimums, start numbered drop table __t%eval(&i-1) ; sequence of data sets %end ; /* end iterative loop */ ------------------------------ */ %if not &done %then create table __t0 as %put WARNING(GROUPIT):Process select &name , &group , stopped by condition MAX=&max; min (&group) as &mingroup %else from &data %do ; group by &name ; create table &out as select &name , &group , create table __t0 as &mingroup select &name , &group , from __t&i min (&mingroup) as &mingroup order by &name , &group from __t0 ; group by &group ; drop table __t&i ; %end ; /* ------------------------------ iterate until done or too many quit ; iterations %mend groupit ; ------------------------------ */ %groupit ( data = w , name = name1 , %do %until (&done or &i > &max) ; group = group1 , out = w2 ) %let i = %eval ( &i + 1 ) ; proc print data = w2 ; create table __t&i as run ; select &name , &group , min (&mingroup) as &mingroup from __t%eval(&i-1) Conclusion group by &name ; I have pointed out six areas where SQL code excels. create table __t&i as My conclusion is that a good SAS programmer can select &name , &group , no longer ignore PROC SQL and remain good. min (&mingroup) as &mingroup from __t&i The author can be contacted by mail at: group by &group ; Westat Inc. 1650 Research Boulevard /* are we finished? */ Rockville, MD 20850-3129 reset noprint ; select w1.&name or by e-mail at: from __t%eval(&i-1) as w1, __t&i as w2 where w1.&name=w2.&name and &group=w2.&group and SAS is a registered trademark or trademark of SAS w1.&mingroup ^= Institute Inc. in the USA and other countries. ® w2.&mingroup indicates USA registration 6