SlideShare a Scribd company logo
1 of 7
Laugh Your Hashes Off: A Look at Multi-Table Hashing
Michael Rainer, Concordance Consulting Group, Minneapolis, MN
ABSTRACT
The crushingweightof bigdata has forcedSASdeveloperstoconsideralternativeapproachestomerginglarge tables
together.One solutiontothisissue isthe hashtable technique (See reference 1fora thorough discussiononhashing).
However,inmany situations the variablesyouneedresideinmanytabulartype datasetsforcingyoutojoinseveral tables
to yourbase dataset.Butthisshouldnotdeteryoufrom employingthe hashtechniquebecause inone datastepyoucan
hash all yourtablessimultaneouslywhilemakingjustone passthroughyourlarge andwide base dataset. Assumingyou
have enoughRAMonyour SASserverto utilize thistechniquewe have experienced runtimereductions of upto90%. This
paperassumesthe readerisfamiliarwithbasichashingtechniquesandbuildsuponthose tacticsinordertoextendthe
powerandapplicabilityof hashing.
INTRODUCTION
Appendingnecessarydatais nota problemwhenyourbase table’ssize ismanageable asthe basicSORT-SORT-MERGEand
PROCSQL jointacticsworkjustfine.Butwhenyouhave bigdata the previousoptionsmaynolongerapply,partially
because the SASservermayrun out of resourcesor the processsimplytakesanunacceptable time torun. Hashing
requiresalarge amountof RAM on yourSAS servertoholdall the target tables inmemorythatyouwant to hash.(To
determine howmuchmemoryyouwill needforaspecificapplicationsee reference #2for downloadable SAS code that
will helpyoudeterminememoryrequirements.) Inthe usual application of the hashtechniquethe developerhasavery
large table andhashesa smallertable inorderto append the required data.Thispaperwill firstillustrate the problemwe
faced,the PROCSQL solutionwe applied,andthenthe hashsolution.We will thenbreakthe code apartone sectionata
time inorderto provide enoughdetail soyoucanapplythistechnique toyourownsituation.The paperconcludesinthe
appendix withafullyexecutable versionof the code alongwithtestdataso youcan verifythatit worksand use as a base
for yourownsituation.
PROBLEM CONTEXT
We were onsite ata large health insurance companywhowasperformingaPeopleSoftmigration.Theirdatawasstoredin
manydifferentSASdatasetsinseveral libraries.We hada PeopleSoftGeneral Ledgertable butitwasmissingseveral
variablesnecessaryforthe conversionprocess. InitiallyI wasstunnedbythe numberof joinsnecessarytoappendall of
the requiredvariables andworriedthatthe usual PROCSQL technique wouldnotsuffice.Butdue tothe ease andspeedof
codingwe firstattemptedthat.
A TYPCIAL SOLUTION
As concernedaswe were aboutthe magnitude of the data we feltcompelledtoattemptafast andeasyway of
accomplishingourtask (pseudocode,notactual code we used):
PROCSQL;
CREATE TABLE WORK.testAS
SELECT A.*
,B.datawewant1
,C.datawewant2
,D.datawewant3
,E.datawewant4
,F.datawewant5
,G.datawewant6
FROM WORK.base ASa
LEFT JOIN WORK.acct AS b ON A.key1=B.key1
LEFT JOIN WORK.bu AS c ON A.key2=C.key2
LEFT JOIN WORK.ou AS d ON A.key3=D.key3
LEFT JOIN WORK.loc AS e ON A.key4=E.key4
LEFT JOIN WORK.prod AS f ON A.key5=F.key5
LEFT JOIN WORK.dept AS g ON A.key6=G.key6
;QUIT;
Perhapsthisplanwastoo ambitionsforthe SQL plannertotry inone stepas it ran for 13 hoursthen the SAS serverran
out of resources andthrew an error. We couldhave brokenthisstepup intotwoor more but now we are makingmultiple
passesthroughthe huge base datasetwhichissomethingwe wantedtoavoid. Further, let’sargue thatthisbrokenup
style wouldhave workedandpreventedthe SASserverfromrunningoutof resources, butwe are still runninginhoursnot
minutes.Atthispointwe were sad --we neededtolaugh--perhapslaughsohardthat we laughedourhashesoff.
OVERALL STRATEGY BROKEN OUT
As previouslymentionedwe wantedtohashall six of ourtarget tables andleftjointhemall ontoourlarge base dataset of
25 million records.
1) INITIALIZE HASH VARIABLES
At the outsetof thispaperwe assumed familiarity withhashtablesand thatyoumay have evenruna few.All of the
followingcode will be familiarjust alittle more of itthan one hashtable.The firststepin the hashprocessis to
initialize the hashvariables.Asyoucansee below,we listoutourdatasetsinthe DO LOOP andthenlistthe variables
we want to append(donotforgetto listyourjoinkeytoofor eachone). The KEEP statementisnotrequiredbutI find
it helpsto organize yourthinkingaboutthe desiredendstate if youjustlisteverythingout.Plusyouwillhave to
anywaywhenyoudefine yourdataforeach hash.
DATA WORK.test_all;
IF 0 THEN DO;%*Comment1;
SET WORK.acct (KEEP=key1datawewant1);%*Listall yourtableshere withthe key(s) and desireddata;
SET WORK.bu(KEEP=key2datawewant2);
SET WORK.ou(KEEP=key3datawewant3);
SET WORK.loc(KEEP=key4datawewant4);
SET WORK.prod(KEEP=key5datawewant5);
SET WORK.dept(KEEP=key6datawewant6);
END;
%*Comment1; The “IF 0 THEN DO” isa way to getthe variablesandtheirmetadataintothe hashwithoutrunning
throughthe entire dataset.“IF0 THEN DO” is nevertrue sothe data are not actuallyloopedthroughinthe PDV saving
precious time.
2) DECLARE THE HASH TABLES
Here we DECLARE the hashtable witha name (e.g.,“acct”, “bu”,“ou”, etc.) and thenindicate the hashpackage’sinternal
table size withthe numberof hashexp(foracomplete discussionof the ramificationsof higherorlowervaluesplease see
reference #3.).Forour purposeshere itwill suffice toknow thatsix (actually2**6) indicatesthe numberof bucketsdata
can be loadedintothe hashtable.Thisnumbercanbe experimentedwithtodetermine the optimalcombinationof width
and depth of buckets.Whenindoubtuse a highnumberlike 14 for hashexp.Inourcase,our target tableswere not
actuallythatlarge (justthe WORK.base table was) sowe usedsix andthatseemedtoworkwell.Lastly we indicatethe
namesof the datasetswhichfor purposesof the illustrationhere are all worktables. There are several otheroptions
available onthe DECLAREHASH statement thatare outside the scope of thispaperbutrealize thathere youcouldindicate
thingslike duplicatesinyourtargettables, outputsortorder,etc.
IF _N_ =1 THEN DO; %*Comment 2;
DECLARE HASH acct (hashexp:6,dataset:‘WORK.acct’);
DECLARE HASH bu (hashexp:6,dataset:‘WORK.bu’);
DECLARE HASH ou (hashexp:6,dataset:‘WORK.ou’);
DECLARE HASH loc (hashexp:6,dataset:‘WORK.loc’);
DECLARE HASH prod (hashexp:6,dataset:‘WORK.prod’);
DECLARE HASH dept(hashexp:6,dataset:‘WORK.dept’);%*The endforthisloopisinthe nextsection;
%*Comment2; IF _N_=1 THEN DO is a wayto run an observation once throughthe datasetinorderto declare the hash
otherwise we will declarethe hashoverandoverfor everyrecordinthe base dataset.
3) DEFINE YOUR KEYS AND DATA
Thissectionisprettystraightforward. Once the hashknowswhichdatasetwe are referringtofromthe above section
we needtotell itwhichis the keyandwhichis the data. If you have more thanone data variable youwantlistout like
this:ou.definedata(‘data31’,‘data32’,‘data33’);.More than one keywouldlooklike this: bu.definekey(‘key2a’,
‘key2b’);. Note thatthe initial partof the stringindicatesthe hashtable name fromabove (i.e.,acct,bu,ou, loc,prod,
dept).
%***Acct;
acct.definekey(‘key1’);%*Notice the dotnotationindicating whichhashwe are referringto;
acct.definedata(‘datawewant1’);
acct.definedone();
%***Bu;
bu.definekey(‘key2’);
bu.definedata(‘datawewant2’);
bu.definedone();
%***Ou;
ou.definekey(‘key3’);
ou.definedata(‘datawewant3’);
ou.definedone();
%***Loc;
loc.definekey(‘key4’);
loc.definedata(‘datawewant4’);
loc.definedone();
%***Prod;
prod.definekey(‘key5’);
prod.definedata(‘datawewant5’);
prod.definedone();
%***Dept;
dept.definekey(‘key6’);
dept.definedata(‘datawewant6’);
dept.definedone();
END;%*Endof the onlyrun once section;
4) START LOOP OVER YOUR BASE
So nowwe start the actual loopovereveryrecordinthe large WORK.base dataset. We will dothisuntil everyrecordhas
beenloopedover. CALLMISSINGisrequiredforeachvariable inyourdatasetinorderto preventthe value frombeing
retainedfromone recordto the nextif the nextrecordismissing. Thisisvery important.
DO UNTIL (eof);%*Doeveryrowhas beenloopedoverinWORK.base;
SET WORK.base END=EOF;%*Our base dataset;
CALL MISSING(datawewant1,datawewant2,datawewant3,datawewant4,datawewant5,datawewant6);
5) PERFORM ACTUAL LOOKUP
1. The secondhalf of the DO loopfromabove performsthe actual lookup. The .find() methodology looksforamatch
betweenakeyonyour base datasetandthe keyinthe hasheddataset ofthe same name. In this caseit is theSQL
equivalentof WORK.dataset1ASA LEFT JOIN WORK.dataset2ON A.key1=B.key1 (See reference 4for more onLEFT
JOIN hashes).Forourrequirements we didnotwanttolose anyof our observationssowe OUTPUTat the end.If
we wantedan INNERJOIN itwouldbe simple enoughto add(insteadof OUTPUT) thisto eachrc line: “IFrc=0
THEN OUTPUT”; (rememberrc=0indicatesa matchin hashlingo, not thisrc=1) and that wouldoutputonlythe
matchesbetweenthe base andthe targetdatasets (youwouldremove the OUTPUTline entirelyinthiscase and
use the logicabove to output).Please note thatif youhave duplicatesinyourtarget tablesyouwill have tocode
aroundthat if you wantto keepthose duplicates,otherwisethe hashprocesswill notloadanyduplicatesintothe
hash.
rc=acct.find();
rc=bu.find();
rc=ou.find();
rc=loc.find();
rc=prod.find();
rc=dept.find();
_ERROR_=0;%*Comment 3;
OUTPUT;
END; %*Endsthe DO UNTIL loopfromabove;
RUN;
%*Comment3; If there is nomatch for a record on any of the six tableswe are joiningdatafromSAS will trytothrowan
error (althoughinversion9.4we noticeditdidnot throw an error).We override thisby resettingSAS’sinternal variable
_ERROR_ back to 0 in orderto preventanerror to the logbeingthrown.If itdid printan errorto the logthe code would
still be correctand youwouldobtainthe same outputbut we believe incleanlogs.
CONCLUSION
Initiallywhenwe attemptedthe PROCSQLtechnique itranfor over13 hoursand threw an “Out of Resources”error. It’s
hard to tell howlongitwouldhave runhad it actuallyhadenoughresources. The hashtechnique howeverfinishedinonly
29 minuteswithouroriginal 25 millionrows(noduplicateswere addedandwe lostnorecords) for a massive reductionin
runtime anda massive increase inclienthappiness. If youhave millionsof rowsandneedvariablesfrommanydifferent
smallertabularstyle tablesthenthistechnique maybe rightforyou.Please see the appendix for functioningcode that
illustratesthe same techniquewhile usingjustafew observations toclearlyillustratethe hashsyntax.
ACKNOWLEDGEMENTS
I wouldlike tothankDwightBuffumforhiswillingnesstoreviewthispaperandprovide hisusual insightful comments.
REFERENCES
1. P. Dorfman:DataStep Hash ObjectsasProgramming Tools
http://www2.sas.com/proceedings/sugi30/236-30.pdf
2. Sample 34193: Howto determinehowmuch memory my hash table will require
http://support.sas.com/kb/34/193.html
3. SASSupport:Discussionof howto calculate the table size of yourhashtables
https://support.sas.com/documentation/cdl/en/ds2ref/66664/HTML/default/viewer.htm#n03n4ipuouac35n136fd
crccdao9.htm
4. K. Borowiak:A Hash Alternativeto the PROCSQL Left Join
(http://www.nesug.org/proceedings/nesug06/dm/da07.pdf)
CONTACT INFORMATION
Your commentsandquestionsare valuedandencouraged aswe are all inthis SASthingtogether.Contactthe authorat:
Michael J Rainer,PhD
Concordance ConsultingGroup,LLC
105 N Elm St
ChaskaMN 55318
Cell:651.492.3973
Email:michael.rainer@concordancegroup.com
SAS is a registered trademark or trademark of SAS Institute, Inc.in the USA and other countries. ® indicates USA registration.Other
brand and productnames are registered trademarks of their respective companies.
APPENDIX
The attached code ismeantto be executable onyoursysteminordertoshow a proof of concept.I woulduse thisasa
base for yourown developmenteffortsandeditasneeded. One couldeasilyuse aDOloopand create millionsof rowsin
orderto performa time test versusthe PROCSQL method if so inclined.The samples werekeptpurposelyshortin order
to simplifytestingandprove the joinsare workingasexpected. Thisappendix code wastestedonSASversion9.4.
DATA WORK.base;
INPUT key1 key2 key3 key4 key5 key6 3.;
DATALINES;
1 2 3 4 5 6
11 22 33 44 55 66
77 88 99 111 222 333
;
RUN;
DATA WORK.acct;
INPUT key1 : 3.
datawewant1 : $5.
;
DATALINES;
1 Acct1
11 Acct1
999 Acct1
;
RUN;
DATA WORK.bu;
INPUT key2 : 3.
datawewant2 : $3.
;
DATALINES;
2 Bu1
22 Bu1
999 Bu1
;
RUN;
DATA WORK.ou;
INPUT key3 : 3.
datawewant3 : $3.
;
DATALINES;
3 Ou1
33 Ou1
999 Ou1
;
RUN;
DATA WORK.loc;
INPUT key4 : 3.
datawewant4 $4.
;
DATALINES;
4 Loc1
44 Loc1
999 Loc1
;
RUN;
DATA WORK.prod;
INPUT key5 : 3.
datawewant5 : $5.
;
DATALINES;
5 Prod1
55 Prod1
999 Prod1
;
RUN;
DATA WORK.dept;
INPUT key6 : 3.
datawewant6 : $5.
;
DATALINES;
6 Dept1
66 Dept1
999 Dept1
;
RUN;
DATA WORK.test_all;
IF 0 THEN DO;%*Comment 1;
SET WORK.acct (KEEP=key1 datawewant1);%*List all your tables here with the key(s) and
desired data;
SET WORK.bu (KEEP=key2 datawewant2);
SET WORK.ou (KEEP=key3 datawewant3);
SET WORK.loc (KEEP=key4 datawewant4);
SET WORK.prod (KEEP=key5 datawewant5);
SET WORK.dept (KEEP=key6 datawewant6);
END;
IF _N_ =1 THEN DO; %*Comment 2;
DECLARE HASH acct (hashexp:6, dataset: 'WORK.acct');
DECLARE HASH bu (hashexp:6, dataset: 'WORK.bu');
DECLARE HASH ou (hashexp:6, dataset: 'WORK.ou');
DECLARE HASH loc (hashexp:6, dataset: 'WORK.loc');
DECLARE HASH prod (hashexp:6, dataset: 'WORK.prod');
DECLARE HASH dept (hashexp:6, dataset: 'WORK.dept');%*The end for this loop is in the next section;
%***Acct;
acct.definekey('key1');%*Notice the dot notation indicating which hash we are referring to;
acct.definedata('datawewant1');
acct.definedone();
%***Bu;
bu.definekey('key2');
bu.definedata('datawewant2');
bu.definedone();
%***Ou;
ou.definekey('key3');
ou.definedata('datawewant3');
ou.definedone();
%***Loc;
loc.definekey('key4');
loc.definedata('datawewant4');
loc.definedone();
%***Prod;
prod.definekey('key5');
prod.definedata('datawewant5');
prod.definedone();
%***Dept;
dept.definekey('key6');
dept.definedata('datawewant6');
dept.definedone();
END;%*End of the only run once section;
DO UNTIL (eof); %*Do until every row has been looped over in WORK.base;
SET WORK.base END=EOF; %*Our base dataset;
CALL MISSING(datawewant1, datawewant2, datawewant3, datawewant4, datawewant5, datawewant6);
rc=acct.find();
rc=bu.find();
rc=ou.find();
rc=loc.find();
rc=prod.find();
rc=dept.find();
_ERROR_=0;%*Comment 3;
OUTPUT;
END; %*Ends the DO UNTIL loop from above;
RUN;

More Related Content

Viewers also liked

Laser trappedmirrorsinspace
Laser trappedmirrorsinspaceLaser trappedmirrorsinspace
Laser trappedmirrorsinspaceClifford Stone
 
Session Control @ nolinux day
Session Control  @ nolinux daySession Control  @ nolinux day
Session Control @ nolinux dayWalter Traspadini
 
Compendio planos
Compendio planosCompendio planos
Compendio planosgato4437
 
Top ficção 08042013
Top ficção 08042013Top ficção 08042013
Top ficção 08042013Andreia1987
 
Ltr, flying objrct incidents in the united states
Ltr, flying objrct incidents in the united statesLtr, flying objrct incidents in the united states
Ltr, flying objrct incidents in the united statesClifford Stone
 
Kỹ thuật thể hiện đồ án
Kỹ thuật thể hiện đồ ánKỹ thuật thể hiện đồ án
Kỹ thuật thể hiện đồ ánVinh Boö
 
Kinematic cellularautomata
Kinematic cellularautomataKinematic cellularautomata
Kinematic cellularautomataClifford Stone
 
Long durationspaceflightoct01
Long durationspaceflightoct01Long durationspaceflightoct01
Long durationspaceflightoct01Clifford Stone
 
Engineering Mission Statement
Engineering Mission StatementEngineering Mission Statement
Engineering Mission StatementPete. Haine
 
Top infantojuvenil 08042013
Top infantojuvenil 08042013Top infantojuvenil 08042013
Top infantojuvenil 08042013Andreia1987
 
Lockbourne aaf, ohio jan. 7, 1948
Lockbourne aaf, ohio jan. 7, 1948Lockbourne aaf, ohio jan. 7, 1948
Lockbourne aaf, ohio jan. 7, 1948Clifford Stone
 

Viewers also liked (20)

Laser trappedmirrorsinspace
Laser trappedmirrorsinspaceLaser trappedmirrorsinspace
Laser trappedmirrorsinspace
 
Session Control @ nolinux day
Session Control  @ nolinux daySession Control  @ nolinux day
Session Control @ nolinux day
 
Compendio planos
Compendio planosCompendio planos
Compendio planos
 
Top ficção 08042013
Top ficção 08042013Top ficção 08042013
Top ficção 08042013
 
Art21 project
Art21 projectArt21 project
Art21 project
 
Ltr, flying objrct incidents in the united states
Ltr, flying objrct incidents in the united statesLtr, flying objrct incidents in the united states
Ltr, flying objrct incidents in the united states
 
Laforge nov99
Laforge nov99Laforge nov99
Laforge nov99
 
Jun00 agenda
Jun00 agendaJun00 agenda
Jun00 agenda
 
Kỹ thuật thể hiện đồ án
Kỹ thuật thể hiện đồ ánKỹ thuật thể hiện đồ án
Kỹ thuật thể hiện đồ án
 
Large telescope
Large telescopeLarge telescope
Large telescope
 
Kinematic cellularautomata
Kinematic cellularautomataKinematic cellularautomata
Kinematic cellularautomata
 
Key to et_messages
Key to et_messagesKey to et_messages
Key to et_messages
 
Actividad de Aprendizaje 8
Actividad de Aprendizaje 8Actividad de Aprendizaje 8
Actividad de Aprendizaje 8
 
Leonda~1
Leonda~1Leonda~1
Leonda~1
 
Lapps oct01
Lapps oct01Lapps oct01
Lapps oct01
 
Jun01 agenda
Jun01 agendaJun01 agenda
Jun01 agenda
 
Long durationspaceflightoct01
Long durationspaceflightoct01Long durationspaceflightoct01
Long durationspaceflightoct01
 
Engineering Mission Statement
Engineering Mission StatementEngineering Mission Statement
Engineering Mission Statement
 
Top infantojuvenil 08042013
Top infantojuvenil 08042013Top infantojuvenil 08042013
Top infantojuvenil 08042013
 
Lockbourne aaf, ohio jan. 7, 1948
Lockbourne aaf, ohio jan. 7, 1948Lockbourne aaf, ohio jan. 7, 1948
Lockbourne aaf, ohio jan. 7, 1948
 

Similar to Laugh Your Hashes Off

rhbase_tutorial
rhbase_tutorialrhbase_tutorial
rhbase_tutorialAaron Benz
 
twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063Madhusudan Anand
 
Integrating hadoop-data-with-oracle
Integrating hadoop-data-with-oracleIntegrating hadoop-data-with-oracle
Integrating hadoop-data-with-oracleJyotirmoy Pramanik
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09guest9d79e073
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Mark Ginnebaugh
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce Amazon Web Services
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
Quick guide to PostgreSQL Performance Tuning
Quick guide to PostgreSQL Performance TuningQuick guide to PostgreSQL Performance Tuning
Quick guide to PostgreSQL Performance TuningRon Morgan
 
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZEMySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZENorvald Ryeng
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 

Similar to Laugh Your Hashes Off (20)

rhbase_tutorial
rhbase_tutorialrhbase_tutorial
rhbase_tutorial
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063
 
Integrating hadoop-data-with-oracle
Integrating hadoop-data-with-oracleIntegrating hadoop-data-with-oracle
Integrating hadoop-data-with-oracle
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
pm1
pm1pm1
pm1
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Quick guide to PostgreSQL Performance Tuning
Quick guide to PostgreSQL Performance TuningQuick guide to PostgreSQL Performance Tuning
Quick guide to PostgreSQL Performance Tuning
 
My sql.ppt
My sql.pptMy sql.ppt
My sql.ppt
 
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZEMySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 

Laugh Your Hashes Off

  • 1. Laugh Your Hashes Off: A Look at Multi-Table Hashing Michael Rainer, Concordance Consulting Group, Minneapolis, MN ABSTRACT The crushingweightof bigdata has forcedSASdeveloperstoconsideralternativeapproachestomerginglarge tables together.One solutiontothisissue isthe hashtable technique (See reference 1fora thorough discussiononhashing). However,inmany situations the variablesyouneedresideinmanytabulartype datasetsforcingyoutojoinseveral tables to yourbase dataset.Butthisshouldnotdeteryoufrom employingthe hashtechniquebecause inone datastepyoucan hash all yourtablessimultaneouslywhilemakingjustone passthroughyourlarge andwide base dataset. Assumingyou have enoughRAMonyour SASserverto utilize thistechniquewe have experienced runtimereductions of upto90%. This paperassumesthe readerisfamiliarwithbasichashingtechniquesandbuildsuponthose tacticsinordertoextendthe powerandapplicabilityof hashing. INTRODUCTION Appendingnecessarydatais nota problemwhenyourbase table’ssize ismanageable asthe basicSORT-SORT-MERGEand PROCSQL jointacticsworkjustfine.Butwhenyouhave bigdata the previousoptionsmaynolongerapply,partially because the SASservermayrun out of resourcesor the processsimplytakesanunacceptable time torun. Hashing requiresalarge amountof RAM on yourSAS servertoholdall the target tables inmemorythatyouwant to hash.(To determine howmuchmemoryyouwill needforaspecificapplicationsee reference #2for downloadable SAS code that will helpyoudeterminememoryrequirements.) Inthe usual application of the hashtechniquethe developerhasavery large table andhashesa smallertable inorderto append the required data.Thispaperwill firstillustrate the problemwe faced,the PROCSQL solutionwe applied,andthenthe hashsolution.We will thenbreakthe code apartone sectionata time inorderto provide enoughdetail soyoucanapplythistechnique toyourownsituation.The paperconcludesinthe appendix withafullyexecutable versionof the code alongwithtestdataso youcan verifythatit worksand use as a base for yourownsituation. PROBLEM CONTEXT We were onsite ata large health insurance companywhowasperformingaPeopleSoftmigration.Theirdatawasstoredin manydifferentSASdatasetsinseveral libraries.We hada PeopleSoftGeneral Ledgertable butitwasmissingseveral variablesnecessaryforthe conversionprocess. InitiallyI wasstunnedbythe numberof joinsnecessarytoappendall of the requiredvariables andworriedthatthe usual PROCSQL technique wouldnotsuffice.Butdue tothe ease andspeedof codingwe firstattemptedthat. A TYPCIAL SOLUTION As concernedaswe were aboutthe magnitude of the data we feltcompelledtoattemptafast andeasyway of accomplishingourtask (pseudocode,notactual code we used): PROCSQL; CREATE TABLE WORK.testAS SELECT A.* ,B.datawewant1 ,C.datawewant2 ,D.datawewant3 ,E.datawewant4 ,F.datawewant5 ,G.datawewant6 FROM WORK.base ASa LEFT JOIN WORK.acct AS b ON A.key1=B.key1 LEFT JOIN WORK.bu AS c ON A.key2=C.key2 LEFT JOIN WORK.ou AS d ON A.key3=D.key3 LEFT JOIN WORK.loc AS e ON A.key4=E.key4 LEFT JOIN WORK.prod AS f ON A.key5=F.key5 LEFT JOIN WORK.dept AS g ON A.key6=G.key6 ;QUIT;
  • 2. Perhapsthisplanwastoo ambitionsforthe SQL plannertotry inone stepas it ran for 13 hoursthen the SAS serverran out of resources andthrew an error. We couldhave brokenthisstepup intotwoor more but now we are makingmultiple passesthroughthe huge base datasetwhichissomethingwe wantedtoavoid. Further, let’sargue thatthisbrokenup style wouldhave workedandpreventedthe SASserverfromrunningoutof resources, butwe are still runninginhoursnot minutes.Atthispointwe were sad --we neededtolaugh--perhapslaughsohardthat we laughedourhashesoff. OVERALL STRATEGY BROKEN OUT As previouslymentionedwe wantedtohashall six of ourtarget tables andleftjointhemall ontoourlarge base dataset of 25 million records. 1) INITIALIZE HASH VARIABLES At the outsetof thispaperwe assumed familiarity withhashtablesand thatyoumay have evenruna few.All of the followingcode will be familiarjust alittle more of itthan one hashtable.The firststepin the hashprocessis to initialize the hashvariables.Asyoucansee below,we listoutourdatasetsinthe DO LOOP andthenlistthe variables we want to append(donotforgetto listyourjoinkeytoofor eachone). The KEEP statementisnotrequiredbutI find it helpsto organize yourthinkingaboutthe desiredendstate if youjustlisteverythingout.Plusyouwillhave to anywaywhenyoudefine yourdataforeach hash. DATA WORK.test_all; IF 0 THEN DO;%*Comment1; SET WORK.acct (KEEP=key1datawewant1);%*Listall yourtableshere withthe key(s) and desireddata; SET WORK.bu(KEEP=key2datawewant2); SET WORK.ou(KEEP=key3datawewant3); SET WORK.loc(KEEP=key4datawewant4); SET WORK.prod(KEEP=key5datawewant5); SET WORK.dept(KEEP=key6datawewant6); END; %*Comment1; The “IF 0 THEN DO” isa way to getthe variablesandtheirmetadataintothe hashwithoutrunning throughthe entire dataset.“IF0 THEN DO” is nevertrue sothe data are not actuallyloopedthroughinthe PDV saving precious time. 2) DECLARE THE HASH TABLES Here we DECLARE the hashtable witha name (e.g.,“acct”, “bu”,“ou”, etc.) and thenindicate the hashpackage’sinternal table size withthe numberof hashexp(foracomplete discussionof the ramificationsof higherorlowervaluesplease see reference #3.).Forour purposeshere itwill suffice toknow thatsix (actually2**6) indicatesthe numberof bucketsdata can be loadedintothe hashtable.Thisnumbercanbe experimentedwithtodetermine the optimalcombinationof width and depth of buckets.Whenindoubtuse a highnumberlike 14 for hashexp.Inourcase,our target tableswere not actuallythatlarge (justthe WORK.base table was) sowe usedsix andthatseemedtoworkwell.Lastly we indicatethe namesof the datasetswhichfor purposesof the illustrationhere are all worktables. There are several otheroptions available onthe DECLAREHASH statement thatare outside the scope of thispaperbutrealize thathere youcouldindicate thingslike duplicatesinyourtargettables, outputsortorder,etc. IF _N_ =1 THEN DO; %*Comment 2; DECLARE HASH acct (hashexp:6,dataset:‘WORK.acct’); DECLARE HASH bu (hashexp:6,dataset:‘WORK.bu’); DECLARE HASH ou (hashexp:6,dataset:‘WORK.ou’); DECLARE HASH loc (hashexp:6,dataset:‘WORK.loc’); DECLARE HASH prod (hashexp:6,dataset:‘WORK.prod’); DECLARE HASH dept(hashexp:6,dataset:‘WORK.dept’);%*The endforthisloopisinthe nextsection; %*Comment2; IF _N_=1 THEN DO is a wayto run an observation once throughthe datasetinorderto declare the hash otherwise we will declarethe hashoverandoverfor everyrecordinthe base dataset. 3) DEFINE YOUR KEYS AND DATA
  • 3. Thissectionisprettystraightforward. Once the hashknowswhichdatasetwe are referringtofromthe above section we needtotell itwhichis the keyandwhichis the data. If you have more thanone data variable youwantlistout like this:ou.definedata(‘data31’,‘data32’,‘data33’);.More than one keywouldlooklike this: bu.definekey(‘key2a’, ‘key2b’);. Note thatthe initial partof the stringindicatesthe hashtable name fromabove (i.e.,acct,bu,ou, loc,prod, dept). %***Acct; acct.definekey(‘key1’);%*Notice the dotnotationindicating whichhashwe are referringto; acct.definedata(‘datawewant1’); acct.definedone(); %***Bu; bu.definekey(‘key2’); bu.definedata(‘datawewant2’); bu.definedone(); %***Ou; ou.definekey(‘key3’); ou.definedata(‘datawewant3’); ou.definedone(); %***Loc; loc.definekey(‘key4’); loc.definedata(‘datawewant4’); loc.definedone(); %***Prod; prod.definekey(‘key5’); prod.definedata(‘datawewant5’); prod.definedone(); %***Dept; dept.definekey(‘key6’); dept.definedata(‘datawewant6’); dept.definedone(); END;%*Endof the onlyrun once section; 4) START LOOP OVER YOUR BASE So nowwe start the actual loopovereveryrecordinthe large WORK.base dataset. We will dothisuntil everyrecordhas beenloopedover. CALLMISSINGisrequiredforeachvariable inyourdatasetinorderto preventthe value frombeing retainedfromone recordto the nextif the nextrecordismissing. Thisisvery important. DO UNTIL (eof);%*Doeveryrowhas beenloopedoverinWORK.base; SET WORK.base END=EOF;%*Our base dataset; CALL MISSING(datawewant1,datawewant2,datawewant3,datawewant4,datawewant5,datawewant6); 5) PERFORM ACTUAL LOOKUP 1. The secondhalf of the DO loopfromabove performsthe actual lookup. The .find() methodology looksforamatch betweenakeyonyour base datasetandthe keyinthe hasheddataset ofthe same name. In this caseit is theSQL equivalentof WORK.dataset1ASA LEFT JOIN WORK.dataset2ON A.key1=B.key1 (See reference 4for more onLEFT JOIN hashes).Forourrequirements we didnotwanttolose anyof our observationssowe OUTPUTat the end.If we wantedan INNERJOIN itwouldbe simple enoughto add(insteadof OUTPUT) thisto eachrc line: “IFrc=0 THEN OUTPUT”; (rememberrc=0indicatesa matchin hashlingo, not thisrc=1) and that wouldoutputonlythe matchesbetweenthe base andthe targetdatasets (youwouldremove the OUTPUTline entirelyinthiscase and use the logicabove to output).Please note thatif youhave duplicatesinyourtarget tablesyouwill have tocode aroundthat if you wantto keepthose duplicates,otherwisethe hashprocesswill notloadanyduplicatesintothe hash. rc=acct.find();
  • 4. rc=bu.find(); rc=ou.find(); rc=loc.find(); rc=prod.find(); rc=dept.find(); _ERROR_=0;%*Comment 3; OUTPUT; END; %*Endsthe DO UNTIL loopfromabove; RUN; %*Comment3; If there is nomatch for a record on any of the six tableswe are joiningdatafromSAS will trytothrowan error (althoughinversion9.4we noticeditdidnot throw an error).We override thisby resettingSAS’sinternal variable _ERROR_ back to 0 in orderto preventanerror to the logbeingthrown.If itdid printan errorto the logthe code would still be correctand youwouldobtainthe same outputbut we believe incleanlogs. CONCLUSION Initiallywhenwe attemptedthe PROCSQLtechnique itranfor over13 hoursand threw an “Out of Resources”error. It’s hard to tell howlongitwouldhave runhad it actuallyhadenoughresources. The hashtechnique howeverfinishedinonly 29 minuteswithouroriginal 25 millionrows(noduplicateswere addedandwe lostnorecords) for a massive reductionin runtime anda massive increase inclienthappiness. If youhave millionsof rowsandneedvariablesfrommanydifferent smallertabularstyle tablesthenthistechnique maybe rightforyou.Please see the appendix for functioningcode that illustratesthe same techniquewhile usingjustafew observations toclearlyillustratethe hashsyntax. ACKNOWLEDGEMENTS I wouldlike tothankDwightBuffumforhiswillingnesstoreviewthispaperandprovide hisusual insightful comments. REFERENCES 1. P. Dorfman:DataStep Hash ObjectsasProgramming Tools http://www2.sas.com/proceedings/sugi30/236-30.pdf 2. Sample 34193: Howto determinehowmuch memory my hash table will require http://support.sas.com/kb/34/193.html 3. SASSupport:Discussionof howto calculate the table size of yourhashtables https://support.sas.com/documentation/cdl/en/ds2ref/66664/HTML/default/viewer.htm#n03n4ipuouac35n136fd crccdao9.htm 4. K. Borowiak:A Hash Alternativeto the PROCSQL Left Join (http://www.nesug.org/proceedings/nesug06/dm/da07.pdf) CONTACT INFORMATION Your commentsandquestionsare valuedandencouraged aswe are all inthis SASthingtogether.Contactthe authorat: Michael J Rainer,PhD Concordance ConsultingGroup,LLC 105 N Elm St ChaskaMN 55318 Cell:651.492.3973 Email:michael.rainer@concordancegroup.com SAS is a registered trademark or trademark of SAS Institute, Inc.in the USA and other countries. ® indicates USA registration.Other brand and productnames are registered trademarks of their respective companies. APPENDIX The attached code ismeantto be executable onyoursysteminordertoshow a proof of concept.I woulduse thisasa base for yourown developmenteffortsandeditasneeded. One couldeasilyuse aDOloopand create millionsof rowsin
  • 5. orderto performa time test versusthe PROCSQL method if so inclined.The samples werekeptpurposelyshortin order to simplifytestingandprove the joinsare workingasexpected. Thisappendix code wastestedonSASversion9.4. DATA WORK.base; INPUT key1 key2 key3 key4 key5 key6 3.; DATALINES; 1 2 3 4 5 6 11 22 33 44 55 66 77 88 99 111 222 333 ; RUN; DATA WORK.acct; INPUT key1 : 3. datawewant1 : $5. ; DATALINES; 1 Acct1 11 Acct1 999 Acct1 ; RUN; DATA WORK.bu; INPUT key2 : 3. datawewant2 : $3. ; DATALINES; 2 Bu1 22 Bu1 999 Bu1 ; RUN; DATA WORK.ou; INPUT key3 : 3. datawewant3 : $3. ; DATALINES; 3 Ou1 33 Ou1 999 Ou1 ; RUN; DATA WORK.loc; INPUT key4 : 3. datawewant4 $4. ; DATALINES; 4 Loc1 44 Loc1 999 Loc1 ; RUN;
  • 6. DATA WORK.prod; INPUT key5 : 3. datawewant5 : $5. ; DATALINES; 5 Prod1 55 Prod1 999 Prod1 ; RUN; DATA WORK.dept; INPUT key6 : 3. datawewant6 : $5. ; DATALINES; 6 Dept1 66 Dept1 999 Dept1 ; RUN; DATA WORK.test_all; IF 0 THEN DO;%*Comment 1; SET WORK.acct (KEEP=key1 datawewant1);%*List all your tables here with the key(s) and desired data; SET WORK.bu (KEEP=key2 datawewant2); SET WORK.ou (KEEP=key3 datawewant3); SET WORK.loc (KEEP=key4 datawewant4); SET WORK.prod (KEEP=key5 datawewant5); SET WORK.dept (KEEP=key6 datawewant6); END; IF _N_ =1 THEN DO; %*Comment 2; DECLARE HASH acct (hashexp:6, dataset: 'WORK.acct'); DECLARE HASH bu (hashexp:6, dataset: 'WORK.bu'); DECLARE HASH ou (hashexp:6, dataset: 'WORK.ou'); DECLARE HASH loc (hashexp:6, dataset: 'WORK.loc'); DECLARE HASH prod (hashexp:6, dataset: 'WORK.prod'); DECLARE HASH dept (hashexp:6, dataset: 'WORK.dept');%*The end for this loop is in the next section; %***Acct; acct.definekey('key1');%*Notice the dot notation indicating which hash we are referring to; acct.definedata('datawewant1'); acct.definedone(); %***Bu; bu.definekey('key2'); bu.definedata('datawewant2'); bu.definedone(); %***Ou; ou.definekey('key3'); ou.definedata('datawewant3'); ou.definedone(); %***Loc; loc.definekey('key4');
  • 7. loc.definedata('datawewant4'); loc.definedone(); %***Prod; prod.definekey('key5'); prod.definedata('datawewant5'); prod.definedone(); %***Dept; dept.definekey('key6'); dept.definedata('datawewant6'); dept.definedone(); END;%*End of the only run once section; DO UNTIL (eof); %*Do until every row has been looped over in WORK.base; SET WORK.base END=EOF; %*Our base dataset; CALL MISSING(datawewant1, datawewant2, datawewant3, datawewant4, datawewant5, datawewant6); rc=acct.find(); rc=bu.find(); rc=ou.find(); rc=loc.find(); rc=prod.find(); rc=dept.find(); _ERROR_=0;%*Comment 3; OUTPUT; END; %*Ends the DO UNTIL loop from above; RUN;