1. Laugh Your Hashes Off: A Look at Multi-Table Hashing
Michael Rainer, Concordance Consulting Group, Minneapolis, MN
ABSTRACT
The crushingweightof bigdata has forcedSASdeveloperstoconsideralternativeapproachestomerginglarge tables
together.One solutiontothisissue isthe hashtable technique (See reference 1fora thorough discussiononhashing).
However,inmany situations the variablesyouneedresideinmanytabulartype datasetsforcingyoutojoinseveral tables
to yourbase dataset.Butthisshouldnotdeteryoufrom employingthe hashtechniquebecause inone datastepyoucan
hash all yourtablessimultaneouslywhilemakingjustone passthroughyourlarge andwide base dataset. Assumingyou
have enoughRAMonyour SASserverto utilize thistechniquewe have experienced runtimereductions of upto90%. This
paperassumesthe readerisfamiliarwithbasichashingtechniquesandbuildsuponthose tacticsinordertoextendthe
powerandapplicabilityof hashing.
INTRODUCTION
Appendingnecessarydatais nota problemwhenyourbase table’ssize ismanageable asthe basicSORT-SORT-MERGEand
PROCSQL jointacticsworkjustfine.Butwhenyouhave bigdata the previousoptionsmaynolongerapply,partially
because the SASservermayrun out of resourcesor the processsimplytakesanunacceptable time torun. Hashing
requiresalarge amountof RAM on yourSAS servertoholdall the target tables inmemorythatyouwant to hash.(To
determine howmuchmemoryyouwill needforaspecificapplicationsee reference #2for downloadable SAS code that
will helpyoudeterminememoryrequirements.) Inthe usual application of the hashtechniquethe developerhasavery
large table andhashesa smallertable inorderto append the required data.Thispaperwill firstillustrate the problemwe
faced,the PROCSQL solutionwe applied,andthenthe hashsolution.We will thenbreakthe code apartone sectionata
time inorderto provide enoughdetail soyoucanapplythistechnique toyourownsituation.The paperconcludesinthe
appendix withafullyexecutable versionof the code alongwithtestdataso youcan verifythatit worksand use as a base
for yourownsituation.
PROBLEM CONTEXT
We were onsite ata large health insurance companywhowasperformingaPeopleSoftmigration.Theirdatawasstoredin
manydifferentSASdatasetsinseveral libraries.We hada PeopleSoftGeneral Ledgertable butitwasmissingseveral
variablesnecessaryforthe conversionprocess. InitiallyI wasstunnedbythe numberof joinsnecessarytoappendall of
the requiredvariables andworriedthatthe usual PROCSQL technique wouldnotsuffice.Butdue tothe ease andspeedof
codingwe firstattemptedthat.
A TYPCIAL SOLUTION
As concernedaswe were aboutthe magnitude of the data we feltcompelledtoattemptafast andeasyway of
accomplishingourtask (pseudocode,notactual code we used):
PROCSQL;
CREATE TABLE WORK.testAS
SELECT A.*
,B.datawewant1
,C.datawewant2
,D.datawewant3
,E.datawewant4
,F.datawewant5
,G.datawewant6
FROM WORK.base ASa
LEFT JOIN WORK.acct AS b ON A.key1=B.key1
LEFT JOIN WORK.bu AS c ON A.key2=C.key2
LEFT JOIN WORK.ou AS d ON A.key3=D.key3
LEFT JOIN WORK.loc AS e ON A.key4=E.key4
LEFT JOIN WORK.prod AS f ON A.key5=F.key5
LEFT JOIN WORK.dept AS g ON A.key6=G.key6
;QUIT;
2. Perhapsthisplanwastoo ambitionsforthe SQL plannertotry inone stepas it ran for 13 hoursthen the SAS serverran
out of resources andthrew an error. We couldhave brokenthisstepup intotwoor more but now we are makingmultiple
passesthroughthe huge base datasetwhichissomethingwe wantedtoavoid. Further, let’sargue thatthisbrokenup
style wouldhave workedandpreventedthe SASserverfromrunningoutof resources, butwe are still runninginhoursnot
minutes.Atthispointwe were sad --we neededtolaugh--perhapslaughsohardthat we laughedourhashesoff.
OVERALL STRATEGY BROKEN OUT
As previouslymentionedwe wantedtohashall six of ourtarget tables andleftjointhemall ontoourlarge base dataset of
25 million records.
1) INITIALIZE HASH VARIABLES
At the outsetof thispaperwe assumed familiarity withhashtablesand thatyoumay have evenruna few.All of the
followingcode will be familiarjust alittle more of itthan one hashtable.The firststepin the hashprocessis to
initialize the hashvariables.Asyoucansee below,we listoutourdatasetsinthe DO LOOP andthenlistthe variables
we want to append(donotforgetto listyourjoinkeytoofor eachone). The KEEP statementisnotrequiredbutI find
it helpsto organize yourthinkingaboutthe desiredendstate if youjustlisteverythingout.Plusyouwillhave to
anywaywhenyoudefine yourdataforeach hash.
DATA WORK.test_all;
IF 0 THEN DO;%*Comment1;
SET WORK.acct (KEEP=key1datawewant1);%*Listall yourtableshere withthe key(s) and desireddata;
SET WORK.bu(KEEP=key2datawewant2);
SET WORK.ou(KEEP=key3datawewant3);
SET WORK.loc(KEEP=key4datawewant4);
SET WORK.prod(KEEP=key5datawewant5);
SET WORK.dept(KEEP=key6datawewant6);
END;
%*Comment1; The “IF 0 THEN DO” isa way to getthe variablesandtheirmetadataintothe hashwithoutrunning
throughthe entire dataset.“IF0 THEN DO” is nevertrue sothe data are not actuallyloopedthroughinthe PDV saving
precious time.
2) DECLARE THE HASH TABLES
Here we DECLARE the hashtable witha name (e.g.,“acct”, “bu”,“ou”, etc.) and thenindicate the hashpackage’sinternal
table size withthe numberof hashexp(foracomplete discussionof the ramificationsof higherorlowervaluesplease see
reference #3.).Forour purposeshere itwill suffice toknow thatsix (actually2**6) indicatesthe numberof bucketsdata
can be loadedintothe hashtable.Thisnumbercanbe experimentedwithtodetermine the optimalcombinationof width
and depth of buckets.Whenindoubtuse a highnumberlike 14 for hashexp.Inourcase,our target tableswere not
actuallythatlarge (justthe WORK.base table was) sowe usedsix andthatseemedtoworkwell.Lastly we indicatethe
namesof the datasetswhichfor purposesof the illustrationhere are all worktables. There are several otheroptions
available onthe DECLAREHASH statement thatare outside the scope of thispaperbutrealize thathere youcouldindicate
thingslike duplicatesinyourtargettables, outputsortorder,etc.
IF _N_ =1 THEN DO; %*Comment 2;
DECLARE HASH acct (hashexp:6,dataset:‘WORK.acct’);
DECLARE HASH bu (hashexp:6,dataset:‘WORK.bu’);
DECLARE HASH ou (hashexp:6,dataset:‘WORK.ou’);
DECLARE HASH loc (hashexp:6,dataset:‘WORK.loc’);
DECLARE HASH prod (hashexp:6,dataset:‘WORK.prod’);
DECLARE HASH dept(hashexp:6,dataset:‘WORK.dept’);%*The endforthisloopisinthe nextsection;
%*Comment2; IF _N_=1 THEN DO is a wayto run an observation once throughthe datasetinorderto declare the hash
otherwise we will declarethe hashoverandoverfor everyrecordinthe base dataset.
3) DEFINE YOUR KEYS AND DATA
3. Thissectionisprettystraightforward. Once the hashknowswhichdatasetwe are referringtofromthe above section
we needtotell itwhichis the keyandwhichis the data. If you have more thanone data variable youwantlistout like
this:ou.definedata(‘data31’,‘data32’,‘data33’);.More than one keywouldlooklike this: bu.definekey(‘key2a’,
‘key2b’);. Note thatthe initial partof the stringindicatesthe hashtable name fromabove (i.e.,acct,bu,ou, loc,prod,
dept).
%***Acct;
acct.definekey(‘key1’);%*Notice the dotnotationindicating whichhashwe are referringto;
acct.definedata(‘datawewant1’);
acct.definedone();
%***Bu;
bu.definekey(‘key2’);
bu.definedata(‘datawewant2’);
bu.definedone();
%***Ou;
ou.definekey(‘key3’);
ou.definedata(‘datawewant3’);
ou.definedone();
%***Loc;
loc.definekey(‘key4’);
loc.definedata(‘datawewant4’);
loc.definedone();
%***Prod;
prod.definekey(‘key5’);
prod.definedata(‘datawewant5’);
prod.definedone();
%***Dept;
dept.definekey(‘key6’);
dept.definedata(‘datawewant6’);
dept.definedone();
END;%*Endof the onlyrun once section;
4) START LOOP OVER YOUR BASE
So nowwe start the actual loopovereveryrecordinthe large WORK.base dataset. We will dothisuntil everyrecordhas
beenloopedover. CALLMISSINGisrequiredforeachvariable inyourdatasetinorderto preventthe value frombeing
retainedfromone recordto the nextif the nextrecordismissing. Thisisvery important.
DO UNTIL (eof);%*Doeveryrowhas beenloopedoverinWORK.base;
SET WORK.base END=EOF;%*Our base dataset;
CALL MISSING(datawewant1,datawewant2,datawewant3,datawewant4,datawewant5,datawewant6);
5) PERFORM ACTUAL LOOKUP
1. The secondhalf of the DO loopfromabove performsthe actual lookup. The .find() methodology looksforamatch
betweenakeyonyour base datasetandthe keyinthe hasheddataset ofthe same name. In this caseit is theSQL
equivalentof WORK.dataset1ASA LEFT JOIN WORK.dataset2ON A.key1=B.key1 (See reference 4for more onLEFT
JOIN hashes).Forourrequirements we didnotwanttolose anyof our observationssowe OUTPUTat the end.If
we wantedan INNERJOIN itwouldbe simple enoughto add(insteadof OUTPUT) thisto eachrc line: “IFrc=0
THEN OUTPUT”; (rememberrc=0indicatesa matchin hashlingo, not thisrc=1) and that wouldoutputonlythe
matchesbetweenthe base andthe targetdatasets (youwouldremove the OUTPUTline entirelyinthiscase and
use the logicabove to output).Please note thatif youhave duplicatesinyourtarget tablesyouwill have tocode
aroundthat if you wantto keepthose duplicates,otherwisethe hashprocesswill notloadanyduplicatesintothe
hash.
rc=acct.find();
4. rc=bu.find();
rc=ou.find();
rc=loc.find();
rc=prod.find();
rc=dept.find();
_ERROR_=0;%*Comment 3;
OUTPUT;
END; %*Endsthe DO UNTIL loopfromabove;
RUN;
%*Comment3; If there is nomatch for a record on any of the six tableswe are joiningdatafromSAS will trytothrowan
error (althoughinversion9.4we noticeditdidnot throw an error).We override thisby resettingSAS’sinternal variable
_ERROR_ back to 0 in orderto preventanerror to the logbeingthrown.If itdid printan errorto the logthe code would
still be correctand youwouldobtainthe same outputbut we believe incleanlogs.
CONCLUSION
Initiallywhenwe attemptedthe PROCSQLtechnique itranfor over13 hoursand threw an “Out of Resources”error. It’s
hard to tell howlongitwouldhave runhad it actuallyhadenoughresources. The hashtechnique howeverfinishedinonly
29 minuteswithouroriginal 25 millionrows(noduplicateswere addedandwe lostnorecords) for a massive reductionin
runtime anda massive increase inclienthappiness. If youhave millionsof rowsandneedvariablesfrommanydifferent
smallertabularstyle tablesthenthistechnique maybe rightforyou.Please see the appendix for functioningcode that
illustratesthe same techniquewhile usingjustafew observations toclearlyillustratethe hashsyntax.
ACKNOWLEDGEMENTS
I wouldlike tothankDwightBuffumforhiswillingnesstoreviewthispaperandprovide hisusual insightful comments.
REFERENCES
1. P. Dorfman:DataStep Hash ObjectsasProgramming Tools
http://www2.sas.com/proceedings/sugi30/236-30.pdf
2. Sample 34193: Howto determinehowmuch memory my hash table will require
http://support.sas.com/kb/34/193.html
3. SASSupport:Discussionof howto calculate the table size of yourhashtables
https://support.sas.com/documentation/cdl/en/ds2ref/66664/HTML/default/viewer.htm#n03n4ipuouac35n136fd
crccdao9.htm
4. K. Borowiak:A Hash Alternativeto the PROCSQL Left Join
(http://www.nesug.org/proceedings/nesug06/dm/da07.pdf)
CONTACT INFORMATION
Your commentsandquestionsare valuedandencouraged aswe are all inthis SASthingtogether.Contactthe authorat:
Michael J Rainer,PhD
Concordance ConsultingGroup,LLC
105 N Elm St
ChaskaMN 55318
Cell:651.492.3973
Email:michael.rainer@concordancegroup.com
SAS is a registered trademark or trademark of SAS Institute, Inc.in the USA and other countries. ® indicates USA registration.Other
brand and productnames are registered trademarks of their respective companies.
APPENDIX
The attached code ismeantto be executable onyoursysteminordertoshow a proof of concept.I woulduse thisasa
base for yourown developmenteffortsandeditasneeded. One couldeasilyuse aDOloopand create millionsof rowsin
6. DATA WORK.prod;
INPUT key5 : 3.
datawewant5 : $5.
;
DATALINES;
5 Prod1
55 Prod1
999 Prod1
;
RUN;
DATA WORK.dept;
INPUT key6 : 3.
datawewant6 : $5.
;
DATALINES;
6 Dept1
66 Dept1
999 Dept1
;
RUN;
DATA WORK.test_all;
IF 0 THEN DO;%*Comment 1;
SET WORK.acct (KEEP=key1 datawewant1);%*List all your tables here with the key(s) and
desired data;
SET WORK.bu (KEEP=key2 datawewant2);
SET WORK.ou (KEEP=key3 datawewant3);
SET WORK.loc (KEEP=key4 datawewant4);
SET WORK.prod (KEEP=key5 datawewant5);
SET WORK.dept (KEEP=key6 datawewant6);
END;
IF _N_ =1 THEN DO; %*Comment 2;
DECLARE HASH acct (hashexp:6, dataset: 'WORK.acct');
DECLARE HASH bu (hashexp:6, dataset: 'WORK.bu');
DECLARE HASH ou (hashexp:6, dataset: 'WORK.ou');
DECLARE HASH loc (hashexp:6, dataset: 'WORK.loc');
DECLARE HASH prod (hashexp:6, dataset: 'WORK.prod');
DECLARE HASH dept (hashexp:6, dataset: 'WORK.dept');%*The end for this loop is in the next section;
%***Acct;
acct.definekey('key1');%*Notice the dot notation indicating which hash we are referring to;
acct.definedata('datawewant1');
acct.definedone();
%***Bu;
bu.definekey('key2');
bu.definedata('datawewant2');
bu.definedone();
%***Ou;
ou.definekey('key3');
ou.definedata('datawewant3');
ou.definedone();
%***Loc;
loc.definekey('key4');