Implementação do Hash Coalha/Coalesced


Published on

Implementations for coalesced

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Implementação do Hash Coalha/Coalesced

  1. 1. 1. IntroductionProgramming Techniques Ellis Horowitz One of the primary uses today for computer technol-and Data Structures Editor ogy is information storage and retrieval. Typical search- ing applications include dictionaries, telephone listings,Implementations for medical databases, symbol tables for compilers, and storing a companys business records. Each package ofCoalesced Hashing information is stored in computer memory as a record. We assume there is a special field in each record, calledJeffrey Scott Vitter the key, that uniquely identifies it. The job of a searchingBrown University algorithm is to take an input K and return the record (if any) that has K as its key. Hashing is a widely used searching technique because The coalesced hashing method is one of the faster no matter how many records are stored, the averagesearching methods known today. This paper is a practical search times remain bounded. The common element ofstudy of coalesced hashing for use by those who intend all hashing algorithms is a predefined and quickly com-to implement or further study the algorithm. Techniques puted hash functionare developed for tuning an important parameter thatrelates the sizes of the address region and the cellar in hash: (all possible keys) --~ (1, 2 . . . . . M}order to optimize the average running times of differentimplementations. A value for the parameter is reported that assigns each record to a hash address in a uniformthat works well in most cases. Detailed graphs explain manner. (The problem of designing hash functions thathow the parameter can be tuned further to meet specific justify this assumption, even when the distribution of theneeds. The resulting tuned algorithm outperforms several keys is highly biased, is well-studied [7, 2].) Hashingwell-known methods including standard coalesced hash- methods differ from one another by how they resolve aing, separate (or direct) chaining, linear probing, and collision when the hash address of the record to bedouble bashing. A variety of related methods are also inserted is already occupied.analyzed including deletion algorithms, a new and im- This paper investigates the coalesced hashing algo-proved insertion strategy called varied-insertion, and ap- rithm, which was first published 22 years ago and is stillplications to external searching on secondary storage one of the faster known searching methods [16, 7]. Thedevices. total number of available storage locations is assumed to be fixed. It is also convenient to assume that these CR Categories and Subject Descriptors: D.2.8 [Soft- locations are contiguous in memory. For the purpose ofware Engineering]: Metrics--performance measures; E.2 notation, we shall number the hash table slots 1, 2 . . . . .[Data]: Data Storage Representations--hash-table rep- M. The first M slots, which serve as the range of theresentations; F.2.2 [Analysis of Algorithms and Problem hash function, constitute the address region. The remain-Complexity]: Nonnumerical Algorithms and Problems-- ing M - - M slots are devoted solely to storing recordssorting and searching; H.2.2 [Database Management]: that collide when inserted; they are called the cellar.Physical Design--access methods; H.3.3 [Information Once the cellar becomes full, subsequent colliders mustStorage and Retrieval]: Information Search and Re- be stored in empty slots in the address region and, thus,trieval-search process may trigger more collisions with records inserted later. General Terms: Algorithms, Design, Performance, For this reason, the search performance of the coa-Theory lesced hashing algorithm is very sensitive to the relative Additional Key Words and Phrases: analysis of algo- sizes of the address region and cellar. In Sec. 4, we applyrithms, coalesced hashing, hashing, data structures, data- the analytic results derived in [10, I1, 13] in order tobases, deletion, asymptotic analysis, average-case, opti- optimize the ratio of their sizes, fl = M/M, which wemization, secondary storage, assembly language call the address factor. The optimizations are based on two performance measures: the number of probes per This research was supported in part by a National Science Foun- search and the running time of assembly language ver-dation fellowship and by National Science Foundation grants MCS- sions. There is no unique best choice for fl--the optimum77-23738 and MCS-81-05324. Authors Present Address: Jeffrey Scott Vitter, Department of address factor depends on the type of search, the numberComputer Science, Box 1910, Brown University, Providence, RI of inserted records, and the performance measure cho-02912. s e n - b u t we shall see that the compromise choice fl Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed for direct 0.86 works well in many situations. The method can becommercial advantage, the ACM copyright notice and the title of the further turned to meet specific needs.publication and its date appear, and notice is given that copying is by Section 5 shows that this tuned method dominatespermission of the Association for Computing Machinery. To copyotherwise, or to republish, requires a fee and/or specific permission. several popular hashing algorithms including standard© 1982 ACM 0001-0782/82/1200-0911 $00.75. coalesced hashing (in which fl = 1), separate (or direct)911 Communications December 1982 of Volume 25 the ACM Number 12
  2. 2. chaining, linear probing, and double hashing. The last of both coalesced hashing and separate chaining, becausethree sections deal with variations and different imple- the cellar is large enough to store the three colliders.mentations for coalesced hashing including deletion al- Figures l(b) and l(c) show how the two methodsgorithms, alternative insertion methods, and external differ. The cellar contains only one slot in the examplesearching on secondary storage devices. in Fig. l(b). When the key MARKcollides with DONNA at This paper is designed to provide a comprehensive slot 4, the cellar is already full. Separate chaining wouldtreatment of the many practical issues concerned with report overflow at this point. The coalesced hashingthe implementation of the coalesced hashing method. method, however, stores the key MARK in the largest-Readers interested in the theoretical justification of the numbered empty space (which is location 10 in theresults in this paper can consult [10, 11, 13, 14, 1]. address region). This causes a later collision when DAVE hashes to position 10, so DAVE is placed in slot 8 at the end of the chain containing DONNA and MARK. The method derives its name from this "coalescing" of rec-2. The Coalesced Hashing Algorithm ords with different hash addresses into single chains. The average number of probes per search shows The algorithm works like this: Given a record with marked improvement in Fig. l(b), even though coalesc-key K, the algorithm searches for it in the hash table, ing has occurred. Intuitively, the larger address regionstarting at location hash(K) and following the links in spreads out the records more evenly and causes fewerthe chain. If the record is present in the table, then it is collisions, i.e., the hash function can be thought of asfound and the search is successful; otherwise, the end of "shooting" at a bigger target. The cellar is now too smallthe chain is reached and the search is unsuccessful. For to store these fewer colliders, so it overflows. Fortunately,simplicity, we assume that the record is inserted when- this overflow occurs late in the game, and the pileupever the search ends unsuccessfully, according to the phenomenon of coalescing is not significant enough tofollowing rule: If position hash(K) is empty, then the counteract the benefits of a larger address region. How-record is stored at that location; else, it is placed in the ever, in the extreme case when M = M = 11 and therelargest-numbered empty slot in the table and is linked to is no cellar (which we call standard coalesced hashing),the end of the chain. This has the effect of putting the coalescing begins too early and search time worsens (asfirst M - - M colliders into the cellar. typified by Figure l(c)). Determining the optimum ad- Coalesced hashing is a generalization of the well- dress factor fl = M/M is a major focus of this paper.known separate (or direct) chaining method. The sepa- The first order of business before we can start arate chaining method halts with overflow when there is detailed study of the coalesced hashing method is tono more room in the cellar to store a collider. The formalize the algorithm and to define reasonableexample in Fig. 1(a) can be considered to be an example measures of search performance. Let us assume that eachFig. 1. Coalesced hashing, M = 11, N = 8. T h e sizes of the address region are (a) M = 8, (b) M = 10, a n d (c) M = I I . (a) (b) (c) address size = 8 a d d r e s s s i z e = 10 a d d r e s s size = 11 1 JEFF 1 A.L. : 1 2 AUDREY 2 2 3 3 JEFF 3 AUDREY 4 DONNA 4 D N A O N ~ 4 MARK 5 A.L. 5 5 AL 6 6 A D E U RY 6 7 TOOTIE 7 7 DAVE ~_ 8 8 JEFF (9) DAVE 9 AL DONNA i(10) MARK i0 MARK / i° TOOTLE(Ii) AL (11) TOOTLE < 11 A.L. Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVE (a) s 2 2 7 4 5 1 2Hash Addresses: (b) 1 6 9 1 4 4 3 10 (o) 11 a 5 3 10 4 10 9average # probes per successful search: (a) 1 2 / 8 = 1.5. (b) l l / 8 = 1.375. (c) 1 4 / 8 = 1.75.912 Communications D e c e m b e r 1982 of V o l u m e 25 the A C M N u m b e r 12
  3. 3. of the M contiguous slots in the coalesced hash table In this paper, we concern ourselves with measuringhas the following organization: the searching phase of Algorithm C and ignore for the most part the insertion time in steps C5 and C6. (The E time for step C5 is not significant, because the total M number of times R is decremented over the course of all P KEY other fields LINK the insertions cannot be more than the number of in- T serted records; hence, the amortized expected number of Y decrements is at most 1. The decrementing operation can also be done in parallel with steps C 1-C4.) Our primaryFor each value of i between 1 and M, E M P T Y [i] is a measure of search performance is the number of probesone-bit field that denotes whether the ith slot is unused, per search, which is the number of different table slotsKEY[i] stores the key (if any), and LINK[i] is either the that are accessed while searching. In Algorithm C, thisindex to the next spot in the chain or else the null value quantity is equal to0. The algorithms in this article are written in the max{ 1, number of times step C3 is performed}English-like style used by Knuth in order to make them For example, in Fig. l(b), the unsuccessful searches forreadily understandable to all and to facilitate compari- keys A.L. and TOOTIE (immediately prior to their inser-sons with the algorithms contained in [7, 4, 12]. Block- tions) each took one probe, while a successful search forstructured languages, like P L / I and Pascal, are good for DAVE would take two probes.expressing complicated program modules; however, they The average performance of the algorithm is ob-are not used here, because hashing algorithms are so tained by assuming that all searches and insertions areshort that there is no reason to discriminate against those random. The Appendix contains a discussion of thewho are not comfortable with such languages. probability model as well as the formulas for the ex-Algorithm C (Coalesced hashing search and insertion). pected number of probes in unsuccessful and successfulThis algorithm searches an M-slot hash table, looking searches.for a given key K. If the search is unsuccessful and thetable is not full, then K is inserted. 3. Assembly Language Implementation The size of the address region is M; the hash functionhash returns a value between 1 and M (inclusive). For Even though probe-counting gives us a good idea ofconvenience, we make use of slot 0, which is always search performance, other factors (such as the complexityempty. The global variable R is used to find an empty of the search loop and the overhead is computing thespace whenever a collision must be stored in the table. hash address) also affect the running time when Algo-Initially, the table is empty, and we have R = M + 1; rithm C is programmed for a real computer. For com-when an empty space is requested, R is decremented pleteness, we optimize the running time of assemblyuntil one is found. We assume that the following initial- language versions of coalesced hashing.izations have been made before any searches or inser- We choose to program in assembly language rathertions are performed: M ~ [tiM], for some constant than in some high-level language like Fortran, PL/I, or0 < fl _< 1; EMPTY[i] ,,-- true, for all 0 _< i _< M; and Pascal, in order to achieve maximum possible efficiency.R ~ - - M + 1. Top efficiency is important in large-scale applications ofC1. [Hash.] Set i ~-- hash(K). (Now 1 _< i _< M.) hashing, but it can also be achieved in smaller systemsC2. [Is there a chain?] If EMPTY[i], then go to step C6. with little extra effort, because hashing algorithms are so (Otherwise, the ith slot is occupied, so we will look short that implementing them (even in assembly lan- at the chain of records that starts there.) guage) is easy. We use a hypothetical language based onC3. [Compare.] I f K = KEY[i], the algorithm terminates Knuths Mix [6] because its features are similar to most successfully. well-known machines and its inherent simplicity allowsC4. [Advance to next record.] If LINK[i] ~ O, then set us to write programs in clear and concise form. i ~ LINK[i] and go back to step C3. Program C below is a Mix-like implementation ofC5. [Find empty slot.] (The search for K in the chain Algorithm C. Liberties have been taken with the lan- was unsuccessful, so we will try to find an empty guage for purposes of clarity; the actual Mxx code appears table slot to store K.) Decrease R one or more times in [10]. The program is written in a five-column format: until EMPTY[R] becomes true. I f R = 0, then there the first column gives the line numbers, the second are no more empty slots, and the algorithm termi- column lists the instruction labels, the third column nates with overflow. Otherwise, append the Rth cell contains the assembly language instructions, the fourth to the chain by setting LINK[i] ~-- R; then set i column counts the number of times the instructions are R. executed, and the last column is for comments thatC6. [Insert new record.] Set EMPTY[i] <--false, KEY[i] explain what the instructions do. The syntax of the K, LINK[i] ~-- O, and initialize the other fields in commands should be clear to those familiar with assem- the record. • bly language programming. The four memory registers913 Communications December 1982 of Volume 25 the ACM N u m b e r 12
  4. 4. used in Program C are named rA, rX, rI, and rJ. The field: empty slots are marked by a - 1 in the L I N K fieldreference KEY(I) denotes the contents of the m e m o r y of that slot. Null links are denoted by a 0 in the L I N Klocation whose address is the value of K E Y plus the field. The variable R and the key K are stored in memorycontents of rI. (This is KEY[i] in the notation of Algo- locations R and K. Registers rI and rA are used to storerithm C.) the values of i and K. Register rJ stores either the value Program C (Coalesced hashing search and insertion). of LINK[i] or R. The instruction labels SUCCESS andThis program follows the conventions of Algorithm C, O V E R F L O W are for exiting and are assumed to lieexcept that the E M P T Y field is implicit in the L I N K somewhere outside this code. I 01 S T A R T LD X, K 1 Step C1. Load rX with K. 02 ENT A, 0 1 Enter 0 into rA. 03 DIV =M= 1 rA ~ [K/M], rX ~-- K mod M. 04 ENT I, X 1 Enter rX into rI. 05 INC I, 1 1 Increment rI by 1. 06 LD A, K 1 Load rA with K. 07 LD J, L I N K ( I ) 1 Step C2. Load rJ with LINK[i]. 08 JN J, STEP6 1 J u m p to STEP6 if LINK[i] < O. 09 CMP A, KEY(l) A Step C3. C o m p a r e K with KEY[i]. 10 JE SUCCESS A Exit (successessfully) if K = KE Y[i]. 11 JZ J, STEP5 A - SI J u m p to STEP5 if LINK[i] = O. 12 STEP4 ENT I, J C - 1 Step C4. Enter rJ into rI. 13 CMP A, KEY(I) C - 1 Step C3. C o m p a r e K with KEY[i]. 14 JE SUCCESS C- 1 Exit (successessfully) if K = KEY[i]. 15 LD J, L I N K ( I ) C - 1 - $2 Load rJ with LINK[i]. 16 JNZ J, STEP4 C - 1 - $2 J u m p to STEP4 if LINK[i] ~ O. 17 STEP5 LD J, R A - S Step C5. Load rJ with R. 18 DEC J, 1 T Decrement R by 1. 19 LD X, L I N K ( J ) T Load rX with LINK[R]. 20 JNN X, .-2 T G o back two steps if LINK[R] >_ O. 21 JZ J, O V E R F L O W A - S Exit (with overflow) if R = 0. 22 ST J, L I N K ( I ) A - S Store R in LINK[i] 23 ENT I, J A - S Enter rJ into rI. 24 ST J, R A - S Update R in memory. 25 STEP6 ST 0, L I N K ( I ) 1- S Step C6. Store 0 in LINK[i]. 26 ST A, KEY(I) 1- S Store K i~ KEY[i]. • The execution time is measured in MIX units of time, The fourth column of Program C expresses the num-which we denote u. The n u m b e r of time units required ber of times each instruction is executed in terms of theby an instruction is equal to the number of m e m o r y quantitiesreferences (including the reference to the instruction C = n u m b e r of probes per search.itself). Hence, the LD, ST, and CMP instructions each A = 1 if the initial probe found an occupied slot,take two units of time, while ENT, INC, DEC, and the 0 otherwise.j u m p instructions require only one time unit. The divi- S = 1 if successful, 0 if unsuccessful.sion operation used to compute the hash address is an T = n u m b e r of slots probed while looking for an emptyexception to this rule; it takes 14u to execute. space. The running time of a MIX program is the weightedsum We further decompose S into S 1 + $2, where S 1 = 1 if the search is successful on the first probe, and S1 = 0 # times ~// # time units ~ otherwise. By formula (1), the total running time of the the i n s t r u c t i o n ~ required by ~ (1) searching phase is each instruction is executed / t h e instruction] in the program (7C + 4A + 17 - 3S + 2 S l ) u (2)This is a somewhat simplistic model, since it does not and the insertion of a new record after an unsuccessfulmake use of cache or buffered m e m o r y for fast access of search (when S = 0) takes an additional (SA + 4 T + 4)u.frequently used data, and since it ignores any interven- The average running time is the expected value of (2),tion by the operating system. But it places all hashing assuming that all insertions and searches are random.algorithms on an equal footing and gives a good indi- The formula can be obtained by replacing the variablescation of relative merit. in Eq. (2) with their expected values.914 Communications D e c e m b e r 1982 of V o l u m e 25 the ACM N u m b e r 12
  5. 5. 4. Tuning fl to Obtain Optimum Performance 4.2 MIX Running Times Optimizing the MIX execution times could be tricky, The purpose of the analysis in [10, 11, 13] is to show in general, because the formulas might have local as wellhow the average-case performance of the coalesced hash- as global minima. Then when we set the derivativesing method varies as a function of the address factor fl equal to 0 in order to find floPr, there might be several= M / M and the load factor a = N/M. In this section, roots to the resulting equations. The crucial fact that letsfor eachfixed value of a, we make use of those results in us apply the same optimization techniques we used aboveorder to "tune" our choice of fl and speed up the search for the number of probes is that the formulas for the MIXtimes. Our two measures of performance are the expected running times are well-behaved, as shown in the Appen-number of probes per search and the average running dix. By that we mean that each formula is minimized attime of assembly language versions. In the latter case, a unique floPT, which occurs either at the endpoint a =we study a MIX implementation in detail, and then show Aft or at the unique point in the "a > Aft" region wherehow to apply what we learn to other assembly languages. the derivative w.r.t, fl is 0. Unfortunately, there is no single choice of fl that The optimization procedure is the same as before.yields best results: the optimum choice flOPWis a function The expected values of formulas (A4) and (A5), whichof the load factor a and it is even different for unsuc- give the MIX running times for unsuccessful and success-cessful and successful searches. The section concludes ful searches, are functions of the three variables a, fl, andwith practical tips on how to initialize ft. In particular, A. We substitute Eq. (3) into the expected running timeswe shall see that the choice fl = 0.86 works well in most in order to express fl in terms of A. For several differentsituations. load factors c~ and for each type of search, we find the value of A that minimizes the formula, and then we4.1 Number of Probes Per Search retranslate this value via Eq. (3) to get flOPW.Figure 2(b) For each fixed value of a, we want to find the values graphs these optimum values flOPW as a function of a;flOPT that minimize the expected number of search probes spline interpolation was used to fill in the gaps. As in thein unsuccessful and successful searches. Formulas (A1) previous section, the formulas for the average unsuccess-and (A2) in the Appendix express the average number ful and successful search times yield different optimumof probes per search as a function of three variables: the address factors. For the successful search case, noticeload factor c~ = N/M, the address factor fl = M/M, how closely flOPT agrees with the corresponding valuesand a new variable A = L/M, where L is the expected that minimize the expected number of probes.number of inserted records needed to make the cellarbecome full. The variables fl and A are related by theformula 1 Fig. 2. The values //OPT that optimize search performance for the e -~ + A = - (3) following three measures: (a) the expected number of probes per B search, (b) the expected running time of Program C, and (c) the expected assembly language running time for large keys.Formulas (A1) and (A2) each have two cases, "a _<Aft" and "a _> Aft," which have the following intuitive 1.o ~ 1.0meanings: The condition a < Aft means that with highprobability not enough records have been inserted to fillup the cellar, while the condition a > Aft means thatenough records have been inserted to make the cellaralmost surely full. The optimum address factor flOPW is always located Successfulsomewhere in the "a _> Aft" region, as shown in theAppendix. The rest of the optimization procedure is a ~.~ 0.9 0.9straightforward application of differential calculus. First,we substitute Eq. (3) into the "a _> Aft" cases of theformulas for the expected number of probes per searchin order to express them in terms of only the two ._Evariables a and A. For each nonzero fixed value of a, the (b) ~ Uns....... ful ~ k" ~ formulas are convex w.r.t. A and have unique minima.We minimize them by setting their derivatives equal to0. Numerical analysis techniques are used to solve the 0,8resulting equations and to get the optimum values of Afor several different values of a. Then we reapply Eq. (3)to express the optimum points in terms of ft. The resultsare graphed in Fig. 2(a), using spline interpolation to fill 0 0.1 0.2 0.3 0,4 0,5 0.6 0.7 0.8 0,9 1.0in the gaps. ].oad]:actor,a915 Communications December 1982 of Volume 25 the ACM Number 12
  6. 6. 4.3 Applying the Results to Other Implementations One strategy is to pick fl = 0.782, which minimizes Our MIX analysis suggests two important principles the expected number of probes per unsuccessful search to be used in finding/?OPT for a particular implementa- as well as the average MIX unsuccessful search time when tion of coalesced hashing. First, the formulas for the the table is full (i.e., load factor a = l), as indicated in expected number of times each instruction in the pro- Fig. 2. This choice of/3 yields the best absolute boundgram is executed (which are expressed for Program C in on search performance, because when the table is full,terms of C, A, S, S 1, $2, and T) may have the two cases, search times are greatest and unsuccessful searches av-"a --< )~/3" and "a _> )~/3," but probably not more. erage slightly longer than successful ones. Regardless of Second, the same optimization process as above can the load factor, the expected number of probes per searchbe used to find /3OPT, because the formulas for the would be at most 1.79, and the average MIX searchingrunning times should be well-behaved for the following time would be bounded by 33.52u.reason: The main difference between Program C and Another strategy is to pick some compromise addressanother implementation is likely to be the relative time factor that leads to good overall performance for a largeit takes to process each key. (The keys are assumed to be range of load factors. A reasonable choice is/3 = 0.86;very small in the MIX version.) Thus, the unsuccessful then the unsuccessful searches are optimized (over allsearch time for another implementation might be ap- other values o f f l ) when the load factor is =0.68 (numberproximately of probes) and ,~0.56 (MIX), and the successful search performance is optimized at load factors -~0.94 (number [(2x + 5)C + (2x + 2)A + ( - 2 x + 19)]u (4) of probes) and -~0.95 (MIX).where u is the standard unit of time on the other Figures 3 through 6 graph the expected search per-computer and x is how many times longer it takes to formance of coalesced hashing as a function of a forprocess a key (multiplied by u/u). Successful search both types of searches (unsuccessful and successful) andtimes would be about for both measures of performance (number of probes and MiX running time). The C1 curve corresponds to [(2x + 5 ) C + 18 + 2 S 1 ] u (5) standard coalesced hashing (i.e., fl = l); the Co.86 line isFormulas (4) and (5) were calculated by increasing the our compromise choice fl = 0.86; and the dashed lineexecution times of the key-processing steps 9 and 13 in CoPx represents the best possible search performanceProgram C by a factor of x. (See formulas (A4) and (A5) that could be achieved by tuning (in which fl is optimizedfor the x = 1 case.) We ignore the extra time it takes to for each load factor).load the larger key and to compute the hash function, Notice that the value/3 = 0.86 yields near-optimumsince that does not affect the optimization. search times once the table gets half-full, so this compro- The role of C in formula (4) is less prevalent than in mise offers a viable strategy. Of course, if some prior(A4) as x gets large: the ratio of the coefficients of C and knowledge about the types and frequencies of theA decreases from 7/4 in (A4) and approaches the limit searches were available, we could tailor our choice of/32/2 = 1 in formula (4). Even in this extreme case, to meet those specific needs.however, computer calculations show that the formulafor the average running time is well-behaved. The valuesof/3OPT that minimize formula (4) when x is large are 5. Comparisonsgraphed in Fig. 2(c). For successful searches, however, the value of C more In this section, we compare the searching times of thestrongly dominates the running times for larger values of coalesced hashing method with those from a represent-x, so the limiting values offloPw in Fig. 2(c) coincide with ative collection of hashing schemes: standard coalescedthe ones that minimize the expected number of probes hashing (C1), separate chaining (S), separate chainingper search in Fig. 2(a). Figure 2(b) shows that the with ordered chains (SO), linear probing (L), and doubleapproximation is close even for the case x = l, which is hashing (D). Implementations of the methods are givenProgram C. in [10]. These methods were chosen because they are the4.4 How to Choose fl most well-known and since they each have implemen- It is important to remember that the address region tations similar to that of Algorithm C. Our comparisonssize M = [tiM] must be initialized when the hash table are based both on the expected number of probes peris empty and cannot change thereafter. Unfortunately, search as well as on the average MIX running time.the last two sections show that each different load factor Coalesced hashing performs better than the othera requires a different optimum address factor /3OPT; in methods. The differences are not so dramatic with thefact, the values of flOPW differ for unsuccessful and suc- MIX search times as with the number of probes percessful searches. This means that optimizing the average search, due to the large overhead in computing the hashunsuccessful (or successful) search time for a certain load address. However, if the keys were larger and compari-factor a will lead to suboptimum performance when the sons took longer, the relative MIX savings would closelyload factor is not equal to a. approximate the savings in number of probes.916 Communications December 1982 of Volume 25 the ACM Number 12
  7. 7. Fig. 3. The average number of probes per unsuccessful search, as M Fig. 4. The average number of probes per successful search, as M and and M --~ ~, for coalesced hashing (C,, Co.86, COPT for fl = 1, 0.86, M ---> o0, for coalesced hashing (C,, C0.~6, COPT for fl = 1, 0.86, floPr), flovr), separate chaining (S), separate chaining with ordered chains separate chaining (S), separate chaining with ordered chains (SO), (SO), linear probing (L), and double hashing (D). linear probing (L), and double hashing (D). 25 L / l 2.5 2.0 Y 2.0 ~ 2.0 1.5 / / 1.5 ;. L5 / / ~ s ~ S, SO ~ so I.O / ~ 1.0 0 0.1 0,2 0.3 0.4 0.5 0.h 0.7 0.8 0.9 1.0 0 0. I 02 0.3 0.4 0,5 06 0.7 08 0.9 1,0 l.oadfactor, a or l.oadl,,ctor,a or5.1 Standard Coalesced Hashing (C1) "tuned" coalesced hashing are identical. Figures 3 and Standard coalesced hashing is the special case of 4 show that the savings in number of probes per searchcoalesced hashing for which fl = 1 and there is no cellar. can be as much as 14 percent (unsuccessful) and 6This is obviously the most realistic comparison that can percent (successful). In Figs. 5 and 6, the correspondingbe made, because except for the initialization of the savings in MIX searching time is 6 percent (unsuccessful)address region size, standard coalesced hashing and and 2 percent (successful).Fig. 5. The average Mix execution time per unsuccessful search, as Fig. 6. The average Mix execution time per successful search, asM ---> oo, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, flOPT), M --> ~, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, floPT),separate chaining (S), separate chaining with ordered chains (SO), separate chaining (S), separate chaining with ordered chains (SO),linear probing (L), and double hashing (D). linear probing (L), and double hashing (D). 40 40 L~D 40 { Cl 35 35 35 30 ~o ~ ~ 30 .= --- x ~. 25 25 ~ J 20 20 20 0 0,1 0.2 0,3 0.4 0.5 0.6 0.7 0,8 0.9 1.0 0 0.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Load Factor, a I.oadFaca)r, a917 Communications December 1982 of Volume 25 the A C M Number 12
  8. 8. 5.2 Separate (or Direct) Chaining (S) cessful search time of Program SO is worse than Program The separate chaining method is given an unfair Cs, and in real-life situations, the difference is likely to advantage in Figs. 3 and 4: the number of probes per be more apparent, because records that are inserted first search is graphed as a function of ~ = N / M rather than tend to be looked up more often and should be kept near a = N / M and does not take into account the number of the beginning of the chain, not rearranged. auxiliary slots used to store colliders. In order to make Method SO has the same storage limitations as thethe comparison fair, we must adjust the load factor separate chaining scheme (i.e., the table usually over- accordingly. flows when N = M = 0.731M), whereas coalesced Separate chaining implementations are designed of- hashing can obtain full storage utilization.ten to accommodate about N = M records; an averageof M(1 - 1 / M ) M ~ M / e auxiliary slots are needed to 5.4 Linear Probing (L) and Double Hashing (D)store the colliders. The total table size is thus M = M When searching for a record with key K, the linear+ M/e. Solving backwards for M, we get M = 0.731M. probing method first checks location hash(K), and ifIn other words, we may consider separate chaining to be another record is already there, it steps cyclically throughthe special case of coalesced hashing for which /3 -~ the table, starting at location hash(K), until the record is0.731, except that no more records can be inserted once found (successful search) or an empty slot is reachedthe cellar overflows. Hence, the adjusted load factor is (unsuccessful search). Insertions are done by placing thea = 0.731~, and overflow occurs when there are around record into the empty slot that terminated the unsuc-N = M = 0.73 I M inserted records. (This is a reasonable cessful search. Double hashing generalizes this by lettingspace/time compromise: if we make M smaller, then the cyclic step size be a function of K.more records can usually be stored before overflow We have to adjust the load factor in the oppositeoccurs, but the average search times blow up; if we direction when we compare Algorithm C with methodsincrease M to get better search times, then overflow L and D, because the latter do not require L I N K fields.occurs much sooner, and many slots are wasted.) For example, if we suppose that the L I N K field com- If we adjust the load factors in Figs. 3 and 4 in this prises ¼ of the total record size in a coalesced hashingway, Algorithm C generates better search statistics: the implementation, then the search statistics in Figs. 3 andexpected number of probes per search for separate chain- 4 for Algorithm C with load factor a should be compareding is -~ 1.37 (unsuccessful) and -~ 1.5 (successful) when against those for linear probing and double hashing withthe load factor 6 is 1, while that for coalesced hashing is load factor (¼)a. In this case, the average number of 1.32 (unsuccessful) and -~ 1.44 (successful) when the probes per search is still better for coalesced hashing.load factor a =/3~ is equal to 0.731. However, the L I N K field is often much smaller than The graphs in Figs. 5 and 6 already reflect this load the rest of the record, and sometimes it can be includedfactor adjustment. In fact, the MIX implementation of in the table at virtually no extra cost. The Mix imple-separate chaining (Program S in [10]) is identical to mentation Program C in [10] assumes that the raix fieldProgram C, except that /3 is initialized to 0.731 and can be squeezed into the record without need of extraoverflow is signaled automatically when the cellar runs storage Space. Figures 5 and 6, therefore, require no loadout of empty slots. Program C is slightly quicker in MIX factor adjustment.execution time than Program S, but more importantly, To balance matters, the M~X implementations of lin-the coalesced hashing implementation is more space ear probing and double hashing, which are given in [10]efficient: Program S usually overflows when a = 0.731, and [7], contain two code optimizations. First, sincewhile Program C can always obtain full storage utiliza- L I N K fields are not used in methods L and D, we notion a = 1. This confirms our intuition that coalesced longer need 0 to denote a null L I N K , and we canhashing can accomodate more records than the separate renumber the table slots from 0 to M - 1; the hashchaining method and still outperform separate chaining function now returns a value between 0 and M - 1.before that method overflows. This makes the hash address computation faster by lu, because the instruction INC I, 1 can be eliminated.5.3 Separate Chaining with Ordered Chains (SO) Second, the empty slots are denoted by the value 0 in This method is a variation of separate chaining in order to make the comparisons in the inner loop as fastwhich the chains are kept ordered by key value. The as possible. This means that records are not allowed toexpected number of probes per successful search does have a key value of 0. The final results are graphed innot change, but unsuccessful searches are slightly Figs. 5 and 6. Coalesced hashing clearly dominates whenquicker, because only about half the chain needs to be the load factor is greater than 0.6.searched, on the average. Our remarks about adjusting the load factor in Figs.3 and 4 also apply to method SO. But even after that is 6. Deletionsdone, the average number of probes per unsuccessfulsearch as well as the expected MIX unsuccessful search It is often useful in hashing applications to be able totime is slightly better for this method than for coalesced delete records when they no longer logically belong tohashing. However, as Fig. 6 illustrates, the average suc- the set of objects being represented in the hash table. For918 Communications December 1982 of Volume 25 the A C M N umbe r 12
  9. 9. example, in an airlines reservations system, passenger TOOTIE rehashes to the hole in location 10, so TOOTIErecords are often expunged soon after the flight has moves up to plug the hole, leaving a new hole in positiontaken place. 9. Next, DONNA collides with AUDREY during rehashing, One possible deletion strategy often used for linear so DONNA remains in slot 8 and is linked to AUDREY.probing and double hashing is to include a special one- Then MARK also collides with AUDREY; we leave MARK inbit D E L E T E D field in each record that says whether or position 7 and link it to DONNA, which was formerly atnot the record has been deleted. The search algorithm the end of AUDREYShash chain. The record JEFF rehashesmust be modified to treat each "deleted" table slot as if to the hole in slot 9, so we move it up to plug the hole,it were occupied by a null record, even though the entire and a new hole appears in position 6. Finally, DAVErecord is still there. This is especially desirable when rehashes to position 9 and joins JEVFS chain.there are pointers to the records from outside the table. Location 6 is the current hole position when the I f there are no such external pointers to worry about, deletion algorithm terminates, so we set EMPTY[6] ~--the "deleted" table slots can be reused for later insertions: true and return it to the pool of empty slots. However,Whenever an empty slot is needed in step C5 of Algo- the value of R in Algorithm C is already 5, so step C5rithm C, the record is inserted into the first "deleted" will never try to reuse location 6 when an empty slot isslot encountered during the unsuccessful search; if there no such slot, an empty slot is allocated in the usual We can solve this problem by using an available-way. However, a certain percentage of the "deleted" slots space list in step C5 rather than the variable R; the listprobably will remain unused, thus preventing full storage must be doubly linked so that a slot can be removedutilization. Also, insertions and deletions over a pro- quickly from the list in step C6. The available-space listlonged period would cause the expected search times to does not require any extra space per table slot, since weapproximate those for a full table, regardless of the can use the K E Y and L I N K fields of the empty slots forn u m b e r of undeleted records, because the "deleted" the two pointer fields. (The K E Y field is much largerrecords make the searches longer. than the L I N K field in typical implementations.) For I f we are willing to spend a little extra time per clarity, we rename the two pointer fields N E X T anddeletion, we can do without the D E L E T E D field by P R E V . Slot 0 in the table acts as the d u m m y start of therelocating some of the records that follow in the chain. available-space list, so NEXT[O] points to the first actualThe basic idea is this: First, we find the record we want slot in the list and PREV[O] points to the last. Beforeto delete, mark its table slot empty, and set the L I N K any records are inserted into the table, the followingfield of its predecessor (if any) to the null value 0. Then extra initializations must be made: NEXT[O] <--- M we use Algorithm C to reinsert each record in the re- P R E V [ M ] ,--- 0; and N E X T [ i ] ~ i - 1 and P R E V [ i -mainder of the chain, but whenever an empty slot is 1] ~ i, for 1 _< i _< M. We replace steps C5 and C6 byneeded in step C5, we use the position that the record C5. [Find empty slot.] (The search for K in the chainalready occupies. was unsuccessful, so we will try to find an empty This method can be illustrated by deleting AL from table slot to store K.) I f the table is already full (i.e.,location l0 in Fig. 7(a); the end result is pictured in Fig. NEXT[O] = 0), the algorithm terminates with over-7(b). The first step is to create a hole in position l0 where flow. Otherwise, set L I N K [ i ] *---NEXT[O] and i *--AL was, and to set AUDREYS L I N K field to 0. Then we NEXT[0].process the remainder of the chain. The next record C6. [Insert new record.] Remove the ith slot from theFig. 7. (a) Inserting the eight records; (b) Inserting all the records except AL. (a) (b) 1 AUDREY 1 AUDREY 2 2 8 3 4 4 I 5 .. DAVE 5 DAVE 6 JEFF 6 7 MARK -~ 7 MARK 8 DONNA -I 8 DONNA 9 ,TOOTlEAL ""~J-1 9 JEFF i0 I0 TOOTIE Ii A.L. Ii A.L. Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE Hash Addresses: 11 1 1 10 1 1 9 9919 Communications December 1982 of Volume 25 the ACM Number 12
  10. 10. available-space list by setting PREV[NEXT[i]] ~-- resulting table is better-than-random: the average search PREV[i] and N E X T [ P R E V [ i ] ] ~-- NEXT[i]. Then times after N random insertions and one deletion are set E M P T Y [ i ] ~-- false, KEY[i] ~ K, L I N K [ i ] ~-- sometimes better (and never worse) than they would be 0, and initialize the other fields in the record. with N - 1 random insertions alone. Whether or not this remains true after more than one deletion is an open The following deletion algorithm is analyzed in problem.detail in [10] and [14]. If this deletion algorithm is used when there is aAlgorithm CD (Deletion with coalesced hashing). This cellar (i.e., fl < 1), we can modify it so that whenever aalgorithm deletes the record with koy K from a coalesced hole appears in the cellar during the execution of Algo-hash table constructed by Algorithm C, with steps C5 rithm CD, the next noncellar record in the chain movesand C6 modified as above. up to plug the hole. Unfortunately, even with this mod- This algorithm preserves the important invariant that ification, the algorithm does not break up chains wellK is stored at its hash address if and only if it is at the enough to preserve randomness. It seems possible thatstart of its chain. This makes searching for Ks predeces- search performance may remain very good anyway.sor in the chain easy: if it exists, then it must come at or Analytic and empirical study is needed to determine justafter position hash(K) in the chain. "how far from r a n d o m " the search times get after dele-C D I . [Search for K.] Set i ~ hash(K). If E M P T Y [ i ] , tions are performed. Two remarks should be made about implementing then K is not present in the table and the algorithm this modified deletion algorithm. In step CD6, the empty terminates. Otherwise, if K = KEY[i], then K is at slot should be returned to the start of the available-space the start of the chain, so go to step CD3. list when the slot is in the cellar; otherwise, it should beCD2. [Split chain in two.] (K is not at the start of its placed at the end. This has the effect of giving cellar slots chain.) Repeatedly set P R E D ~-- i and i *-- higher priority on the available-space list. Second, if a L I N K [ i ] until either i = 0 or K = KEY[i]. I f i = cellar slot is freed by a deletion and then reallocated 0, then K is not present in the table, and the during a later insertion, it is possible for chain to go in algorithm terminates. Else, set L I N K [ P R E D ] and out of the cellar more than once. Programmers 0. should no longer assume that a chains cellar slots im-CD3. [Process remainder of chain.] (Variable i will walk mediately follow the start of the chain. through the successors of K in the chain.) Set H O L E ~ i, i ~ LINK[i], L I N K [ H O L E ] ~-- O. Do step CD4 zero or more times until i = 0. Then 7. Implementations and Variations go to step CD5.CD4. [Rehash record in ith slot.] Set j ~ hash(KEY[i]). Most important searching algorithms have several I f j = H O L E , we move up the record to plug the different implementations in order to handle a variety of hole by setting K E Y [ H O L E ] ~-- KEY[i] and applications; coalesced hashing is no exception. We have H O L E ~ i. Otherwise, we link the record to the already discussed some modifications in the last section end of its hash chain by doing the following: set in connection with deletion algorithms. In particular, we j .-- L I N K [ j ] zero or more times until L I N K [ j ] needed to use a doubly linked available-space list so that = 0; then set L I N K [ j ] *-- i. Set k *-- LINK[i], the empty slots could be added and removed quickly. LINK[i] ~ O, and i *-- k. Repeat step CD4 unless Thus, the cellar need not be contiguous. Another strategy i=0. to handle a noncontiguous cellar is to link all the tableCDS. [Mark slot H O L E empty.] Set E M P T Y [ H O L E ] slots together initially and to replace "Decrease R " in true. Place H O L E at the start of the available- step C5 of Algorithm C with "Set R *-- L I N K [ R ] . " With space list by setting N E X T [ H O L E ] ~ NEXT[O], either modification, Algorithm C can simulate the sepa- PRE V [ H O L E ] ~-- O, P R E V[NEXT[O]] ~ H O L E , rate chaining method until the cellar empties; subsequent NEXT[O] ~ H O L E . • colliders can be stored in the address region as usual. Algorithm CD has the important property that it Hence, coalesced hashing can have the benefit of dy-preserves randomness for the special case of standard namic allocation as well as total storage utilization.coalesced hashing (when M = M ) , in that deleting a Another c o m m o n data structure is to store pointersrecord is in some sense like never having inserted it. The to the fields, rather than the fields themselves, in the"sense" is strong enough so that the formulas for the table slots. For example, if the records are large, weaverage search times are still valid after deletions are might want to store only the key and link values in eachperformed. Exactly what preserving randomness means slot, along with a pointer to where the rest of the recordis explained in detail in [14]. is located. We expand upon this idea later in this section. We can speed up the rehashing phase in the latter If we are willing to do extra work during insertionhalf of step CD4 by linking the record into the chain and if the records are not pointed to from outside theimmediately after its hash address rather than at the end table, we can modify the insertion algorithm to preventof the chain. When this modified deletion algorithm is the chains from coalescing: W h e n a record R1 collidescalled on a random standard coalesced hash table, the during insertion with another record Rz that is not at the920 Communications December 1982 of Volume 25 the ACM Number 12
  11. 11. start of the chain, we store R, at its hash address and rithm (Algorithm C in Sec. 2) as the late-insertionrelocate R2 to some other spot. (The LINK field of R2s method.predecessor must be updated.) The size of the records Early-insertion can be used even if we do not have ashould not be very large or else the cost of rearrangement priori knowledge about the records presence, in whichmight get prohibitive. There is an alternate strategy that case the entire chain must be searched in order to verifyprevents coalescing and does not relocate records, but it that the record is not already stored in the table. We canrequires an extra link field per slot and the searches are implement this form of early-insertion by making theslightly longer. One link field is used to chain together following two modifications to Algorithm C. First, weall the records with the same hash address. The other add the assignment "Set j ~-- i" at the end of step C2, solink field contains for slot i a pointer to the start of the that j stores the hash address hash(K). The secondchain of records with hash address i. Much of the space modification replaces the last sentence of step C5 byfor the link fields is wasted, and chains m a y start one "Otherwise, link the R t h cell into the chain immediatelylink away from their hash address. Resources could be after the hash addressj by setting LINK[R] ~--LINK[j],put to better use by using coalesced hashing. LINK[j] ~ R; then set i ~ R." This section is devoted to the more nonobvious im- Each chain of records formed using early-insertionplementations of coalesced hashing. First, we describe contains the same records as the corresponding chainalternate insertion strategies and then conclude with formed by late-insertion. Since the length of a randomthree applications to external searching on secondary unsuccessful search depends only on the number ofstorage devices. A scheme that allows the coalesced hash records in the chain between the hash address and thetable to share m e m o r y with other data structures can be end of the chain, and since all the records are in thefound in [ 12]. A generalization of coalesced hashing that address region when there is no cellar, it must be trueuses nonuniform hash functions is described in [13]. that the average n u m b e r of probes per unsuccessful search is the same for the two methods if there is no7.1 Early-Insertion and Varied-lnsertion Coalesced cellar. However, the order of the records within eachHashing chain m a y be different for early-insertion than for late- I f we know a priori that a record is not already insertion. When there is no cellar, the early-insertionpresent in the table, then it is not necessary in Algorithm algorithm causes the records to align themselves in theC to search to the end of the chain before the record is chains closer to their hash addresses, on the average,inserted: I f the hash address location is empty, the record than would be the case with late-insertion, so the ex-can be inserted there; otherwise, we can link the record pected successful search times are better.into the chain immediately after its hash address by A typical case is illustrated in Fig. 8. The record DAVErerouting pointers. We call this the early-insertion method collides with A.L. at slot 5. In Fig. 8(a), which uses late-because the collider is linked "early" in the chain, rather insertion, DAVE is linked to the end of the chain contain-than at the end. We will refer to the unmodified algo- ing A.L., whereas if we use early-insertion as in Fig. 8(b),Fig. 8. Standard Coalesced Hashing, M = M = 11, N = 8. (a) Late-insertion; (b) Early-insertion. (a) (b) late-insertion early-insertion a d d r e s s s i z e = 11 a d d r e s s s i z e = 11 1 AUDREY 1 AUDREY 2 2 3 DONNA S DONNA 4 JEFF 4 JE~ 5 A.L. ~, 5 A.L. 6 6 7 7 8 ..DAVE DAVE 9 MARK ~ 9 MARK 10 TOOTIE 10 TOOTIE ii AL / . 11 AL -I" Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE Hash Addresses: 5 1 5 10 3 11 4 5 ave. ]/probes per succ. search: ( a ) 1 3 / 8 ~ 1.63, ( b ) 1 2 / 6 = 1.5.921 Communications D e c e m b e r 1982 of Volume 25 the A C M N u m b e r 12
  12. 12. DAVE is linked into the chain at the point between A.L. identical to early-insertion. In the varied-insertion and AL. The average successful search time in Fig. 8(b) method, the early-insertion strategy is used except when is slightly better than in Fig. 8(a), because linking DAVE the cellar is full and the hash address of the inserted into the chain immediately after A.L. (rather than at the record is the start of a chain that has records in the end of the chain) reduces the search time for DAVE from cellar. In that case, the record is linked into the chain four probes to two and increases the search time for AL immediately after the last cellar slot in the chain. from two probes to three. The result is a net decrease of Figure 9(c) shows a typical hash table constructed one probe. using varied-insertion. The cellar is already full when One can show easily that this effect manifests itself the record DAVE is inserted. The hash address of DAVE is only on chains of length greater than 3, so there is little 1, which is at the start of a chain that has records in the improvement when the load factor a is small, since the cellar. Therefore, early-insertion is not used, and DAVE chains are usually short. Recent theoretical results show is instead linked into the chain immediately after AL, that the average number of probes per successful search which is the last record in the chain that is in the cellar. is 5 percent better with early-insertion than with late- The average n u m b e r of probes per search is better for insertion when there is no cellar and the table is full (i.e., varied-insertion than for both late-insertion and early- a = 1), but is only 0.5 percent better when a = 0.5 insertion. [1, 5]. A possible disadvantage of early-insertion is that The varied-insertion method incorporates the advan- earlier colliders tend to be shoved to the rear by later tages of early-insertion, but without any of the drawbacks ones, which m a y not be desirable in some practical described three paragraphs earlier. The records of a situations when the records inserted first tend to be chain that are in the cellar always come immediately accessed more often than those inserted later. Neverthe- after the start of the chain. The average n u m b e r of less, early-insertion is an improvement over late-insertion probes per search for varied-insertion is always less than when there is no cellar. or equal to that for late-insertion and early-insertion. When there is a cellar, preliminary studies indicate For unsuccessful searches, the expected n u m b e r of that search performance is probably worse with early- probes for varied-insertion and late-insertion are identi- insertion than with Algorithm C, because a chains rec- cal. ords that are in the cellar now come at the end of the Research is currently underway to determine the chain, whereas with late-insertion they come immedi- average search times for the varied-insertion method, as ately after the start. In the example in Fig. 9(b), the well as to find the values of the o p t i m u m address factor insertion of JEFF causes both cellar records AL and TOOTIE flOVV. We expect that the initialization fi ~-- 0.86 will be to move one link further from their hash addresses. That preferred in most situations, as it is for late-insertion. does not happen with late-insertion in Fig. 9(b). The resulting search times for varied-insertion should be We shall now introduce a new variant, called varied- a slight improvement over late-insertion. insertion, that can be shown to be better than both the The idea of linking the inserted record into the chain late-insertion and early-insertion methods when there is immediately after its hash address has been incorporated a cellar. When there is no cellar, varied-insertion is into the first modification of Algorithm CD in the last Fig. 9. Coalesced Hashing, M = 11, M = 9, N = 8. (a) Late-insertion; (b) Early-insertion; and (c) Varied-insertion. (a) (b) (o) late-insertion early-insertion varie d-insertion a d d r e s s size = 9 a d d r e s s size = 9 a d d r e s s size = 9 I A.L. 1 A.L. • 1 A.L. 2 , t 2 2 3 AUDREY i : i : I 3 AUDREY -- ~, 3 AUDREY • 4 , [ 4 4 5 I 5 5 6 DAVE t 6 DAVE "-- 6 DAVE --~ 7 JEFF -11 7 JEFF "-- 7 JEFF 8 MARK ~ ~- 8 MARK " 8 MARK .I <---i I 9 DONNA l -- -J 9 DONNA 9 DONNA i •(10) TOOTIEAL 10) TOOTLE --. (10) TOOTIE --~(11) I1) AL (ii) AL -.. Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVEH a s h Addresses: 1 3 1 1 3 1 8 1ave. # probes per unsuec, search: ( a ) 1 8 / 9 = 2.0, ( b ) 2 4 / 9 ~ 2.67, ( c ) 1 8 / 9 = 2.0.ave. # probes per succ. search: ( a ) 2 1 / 8 ~ 2 . 6 3 , ( b ) 2 2 / 8 = 2.75, ( c ) 2 0 / 8 = 2.5. 922
  13. 13. section. It is natural to ask whether the modified deletion Deletions can be done in one of several ways, anal-algorithm would preserve randomness for the modified ogous to the different methods discussed in the lastinsertion algorithms presented in this section. The answer section. In some cases, it is best merely to mark theis no, but it is possible that the deletion algorithm could record as "deleted," because there may be pointers to themake the table better-than-random, as discussed at the record from somewhere outside the hash table, andend of the last section. Finding good deletion algorithms reusing the space could cause problems. Besides, m a n yfor early-insertion and varied-insertion as well as for large scale database systems undergo periodic reorgani-late-insertion is a challenging problem. zation during low-peak hours, in which the entire table (minus the deleted records) is reconstructed from scratch7.2 Coalesced Hashing with Buckets [15]. This method has not been analyzed analytically, Hashing is used extensively in database applications but it seems to have great potential.and file systems, where the hash table is too large to fitentirely in main memory and must be stored on external 7.3 Hash Tables Within a Hash Tabledevices, like disks and drums. The hash table is sectioned When the record size R is small compared to theoff into blocks (or pages), each block containing b rec- block size B, the resulting bucket size b ~ B/R isords; transfers to and from main m e m o r y take place a relatively large. Sequential search through the blocks isblock at a time. Searching time is dominated by the now too slow. (The block transfer rate no longer domi-block transfer rate; now the object is to minimize the nates search times.) Other methods should be used toexpected number of block accesses per search. organize the records within blocks. Operating systems with a virtual memory environ- This is especially true with multiattribute indexing, inment are designed to break up data structures into blocks which we can look up records via one of several differentautomatically, even though it appears to the programmer keys. For example, a large university database may allowthat his data structures all reside in main memory. Linear a students record to be accessed by specifying either hisprobing (see Sec. 5) is often the best hashing scheme to name, social security number, student I.D., or bankuse in this environment, because successive probes occur account number. In this case, four hash tables are contiguous locations and are apt to be in the same Instead of storing all the records in four different tables,block. Thus, one or two block accesses are usually suf- we let the four tables share a single copy of the records.ficient for lookup. Each hash table entry consists of only the key value, the We can do better if we know beforehand where the link field, and a pointer to the rest of the student recordblock divisions occur. We treat each block as a large (which is stored in some other block). Lookup nowtable slot or bucket that can store b records. Let M be requires one extra block access. Continuing our numer-the total number of buckets. The following modification ical example, the table record size reduces from R --- 400of Algorithm C appears in [7]. bytes to about R = 12 bytes, since the key occupies To process a record with key K, we search for it in 7 bytes, and the two pointer fields presumably can bethe chain of buckets, starting at bucket hash(K). After squeezed into the remaining 5 bytes. The bucket size ban unsuccessful search, we insert the record into the last is now about B / R ..~ 333.bucket in the chain if there is room, or else we store it in In such cases where b is rather large and searchingsome nonfull bucket and link that bucket to the end of within a bucket can get expensive, it pays to organizethe chain. We can speed up this last part by maintaining each bucket as a hash table. The hash function must bea doubly linked circular list of nonfull buckets, with a modified to return a binary number at least [log M ] +"roving pointer" marking one of the buckets. Each time [log b] bits in length; the high-order bits of the hashwe need another nonfull bucket to store a collider, we address specify one of the M buckets (or blocks), andinsert the record into the bucket indicated by the roving the low-order bits specify one of the b record positionspointer, and then we reset the roving pointer to the next within that bucket. Coalesced hashing is a naturalbucket on the list. This helps distribute the records method to use because the bucket size (in this example,evenly, because different chains will use different buckets b = 333) imposes a definite constraint on the number of(at least until we make one loop through the available- records that m a y be stored in a block, so it is reasonablebucket list). When the external device is a disk, block to try to optimize the amount of space devoted to theaccesses are faster when they occur on the same cylinder, address region versus the amount of space devoted to theso we should keep a separate available-bucket list for cellar.each cylinder. Record size varies from application to application, 7.4 Dynamic Hashingbut for purposes of illustration, we use the following So far we have not addressed the problem of what toparameters: the block size B is 4000 bytes; the total do when overflow occurs--when we want to insert morerecord size R is 400 bytes, of which the key comprises 7 records into a hash table that is already full. The c o m m o nbytes. The bucket size b is approximately B/R = 10. technique is to place the extra records into an auxiliaryWhen the size of the bucket is that small, searching in storage pool and link them to the main table. Searcheach bucket can be done sequentially; there is no need performance remains tolerable as long as the number offor the record size to be fixed, as long as each record is insertions after overflow does not get too large. (Guibaspreceded by its length (in bytes). [4] analyzes this for the special case of standard coalesced923
  14. 14. hashing.) Later during the off-hours when the system is rithms and the design of new strategies that hopefully not heavily used, a larger table is allocated and the will preserve randomness. The variant methods in Sec. records are reinserted into the new table. 7 also pose interesting theoretical and practical open prob- This strategy is not viable when database utilization lems. The search performance of varied-insertion coa-is relatively constant with time. Several similar methods, lesced hashing is slightly better than Algorithm C; re-known loosely as dynamic hashing, have been devised search is currently underway to analyze its performancethat allow the table size to grow and shrink dynamically and to determine the optimum setting flopt. One excit-with little overhead [3, 8, 9]. When the load factor gets ing aspect of coalesced hashing is that it is an extreme-too high or when buckets overflow, the hash table grows ly good technique which very likely can be made evenlarger and certain buckets are split, thereby reducing the more applicable when these open questions are solved.congestion. If the bucket size is rather large, for example,if we allow multiattribute accessing, then coalesced hash-ing can be used to organize the records within a block, Appendixas explained above, thus combining this technique withcoalesced hashing in a truly dynamic way. For purposes of average-case analysis, we assume that an unsuccessful search can begin at any of the M address region slots with equal probability. This includes8. Conclusions the special case of insertion. Similarly, each record in the table has the same chance of being the object of any Coalesced hashing is a conceptually elegant and ex- given successful search. In other words, all searches andtremely fast method for information storage and re- insertions involve random keys. This is sometimes calledtrieval. This paper has examined in detail several prac- the Bernoulli probability modeltical issues concerning the implementation of the The asymptotic formulas in this section apply to amethod. The analysis and programming techniques pre- random M-slot coalesced hash table with address regionsented here should allow the reader to determine whether size M = [tiM] and with N -- raM] inserted records,coalesced hashing is the method of choice in any given where the load factor a and the address factor fl aresituation, and if so, to implement an efficient version of constants in the ranges 0 _< a <- l and 0 < fl _ I. Formalthe algorithm. derivations are given in [10, I l, 13]. The most important issue addressed in this paper isthe initialization of the address factor ft. The intricate Number of Probes Per Searchoptimization process discussed in Sec. 4 and the Appen- The expected number of probes in unsuccessful anddix can in principle be applied to any implementation of successful searches, respectively, as M ~ oo iscoalesced hashing. Fortunately, there is no need to un-dertake such a computational burden for each applica- + e -~/B if a <-- Xfltion, because the results presented in this paper apply tomost reasonable implementations. The initialization fl 1 1 2 C~(M, M) - ~ + g (~(o/B-~) _ l) 3 - ~ + 2X 0.86 is recommended in most cases, because it givesnear-optimum search performance for a wide range ofload factors. The graph in Fig. 2 makes it possible tofine-tune the choice of fl, in case some prior knowledgeabout the types and frequencies of the searches is avail-able. f°I +-- la 2B l(fl 2 •) ( ) ifa>__)~fi (AI) ifa--<~,fl The comparisons in Sec. 5 show that the tunedcoalesced hashing algorithm outperforms several popular IBhashing methods when the load factor is greater then 0.6. I+--. 8aThe differences are more pronounced for large records.The inner search loop in Algorithm C is very short andsimple, which is important for practical implementations. CN(M, M)Coalesced hashing has the advantage over other chainingmethods that it uses only one link field per slot and can (3achieve full storage utilization. The method is especiallysuited for applications with a constrained amount ofmemory or with the requirement that the records cannot +~ +X ), +~Xbe relocated after they are inserted. In applications where deletions are necessary, one ofthe strategies described in Sec. 6 should work well inpractice. However, research remains to be done in several where X is the umque nonnegative solution to the equa-areas including the analysis of the current deletion algo- tion924 Communications December 1982 of Volume 25 the ACM Number 12