Implementação do Hash Coalha/Coalesced

1. Introduction

Programming Techniques Ellis Horowitz One of the primary uses today for computer technol-
and Data Structures Editor ogy is information storage and retrieval. Typical search-
ing applications include dictionaries, telephone listings,
Implementations for medical databases, symbol tables for compilers, and
storing a company's business records. Each package of
Coalesced Hashing information is stored in computer memory as a record.
We assume there is a special field in each record, called
Jeffrey Scott Vitter the key, that uniquely identifies it. The job of a searching
Brown University algorithm is to take an input K and return the record (if
any) that has K as its key.
Hashing is a widely used searching technique because
The coalesced hashing method is one of the faster no matter how many records are stored, the average
searching methods known today. This paper is a practical search times remain bounded. The common element of
study of coalesced hashing for use by those who intend all hashing algorithms is a predefined and quickly com-
to implement or further study the algorithm. Techniques puted hash function
are developed for tuning an important parameter that
relates the sizes of the address region and the cellar in hash: (all possible keys) --~ (1, 2 . . . . . M}
order to optimize the average running times of different
implementations. A value for the parameter is reported that assigns each record to a hash address in a uniform
that works well in most cases. Detailed graphs explain manner. (The problem of designing hash functions that
how the parameter can be tuned further to meet specific justify this assumption, even when the distribution of the
needs. The resulting tuned algorithm outperforms several keys is highly biased, is well-studied [7, 2].) Hashing
well-known methods including standard coalesced hash- methods differ from one another by how they resolve a
ing, separate (or direct) chaining, linear probing, and collision when the hash address of the record to be
double bashing. A variety of related methods are also inserted is already occupied.
analyzed including deletion algorithms, a new and im- This paper investigates the coalesced hashing algo-
proved insertion strategy called varied-insertion, and ap- rithm, which was first published 22 years ago and is still
plications to external searching on secondary storage one of the faster known searching methods [16, 7]. The
devices. total number of available storage locations is assumed to
be fixed. It is also convenient to assume that these
CR Categories and Subject Descriptors: D.2.8 [Soft- locations are contiguous in memory. For the purpose of
ware Engineering]: Metrics--performance measures; E.2 notation, we shall number the hash table slots 1, 2 . . . . .
[Data]: Data Storage Representations--hash-table rep- M'. The first M slots, which serve as the range of the
resentations; F.2.2 [Analysis of Algorithms and Problem hash function, constitute the address region. The remain-
Complexity]: Nonnumerical Algorithms and Problems-- ing M ' - - M slots are devoted solely to storing records
sorting and searching; H.2.2 [Database Management]: that collide when inserted; they are called the cellar.
Physical Design--access methods; H.3.3 [Information Once the cellar becomes full, subsequent colliders must
Storage and Retrieval]: Information Search and Re- be stored in empty slots in the address region and, thus,
trieval-search process may trigger more collisions with records inserted later.
General Terms: Algorithms, Design, Performance, For this reason, the search performance of the coa-
Theory lesced hashing algorithm is very sensitive to the relative
Additional Key Words and Phrases: analysis of algo- sizes of the address region and cellar. In Sec. 4, we apply
rithms, coalesced hashing, hashing, data structures, data- the analytic results derived in [10, I1, 13] in order to
bases, deletion, asymptotic analysis, average-case, opti- optimize the ratio of their sizes, fl = M/M', which we
mization, secondary storage, assembly language call the address factor. The optimizations are based on
two performance measures: the number of probes per
This research was supported in part by a National Science Foun- search and the running time of assembly language ver-
dation fellowship and by National Science Foundation grants MCS- sions. There is no unique best choice for fl--the optimum
77-23738 and MCS-81-05324.
Author's Present Address: Jeffrey Scott Vitter, Department of address factor depends on the type of search, the number
Computer Science, Box 1910, Brown University, Providence, RI of inserted records, and the performance measure cho-
02912. s e n - b u t we shall see that the compromise choice fl
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for direct 0.86 works well in many situations. The method can be
commercial advantage, the ACM copyright notice and the title of the further turned to meet specific needs.
publication and its date appear, and notice is given that copying is by Section 5 shows that this tuned method dominates
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission. several popular hashing algorithms including standard
© 1982 ACM 0001-0782/82/1200-0911 $00.75. coalesced hashing (in which fl = 1), separate (or direct)

911 Communications December 1982
of Volume 25
the ACM Number 12

chaining, linear probing, and double hashing. The last of both coalesced hashing and separate chaining, because
three sections deal with variations and different imple- the cellar is large enough to store the three colliders.
mentations for coalesced hashing including deletion al- Figures l(b) and l(c) show how the two methods
gorithms, alternative insertion methods, and external differ. The cellar contains only one slot in the example
searching on secondary storage devices. in Fig. l(b). When the key MARKcollides with DONNA at
This paper is designed to provide a comprehensive slot 4, the cellar is already full. Separate chaining would
treatment of the many practical issues concerned with report overflow at this point. The coalesced hashing
the implementation of the coalesced hashing method. method, however, stores the key MARK in the largest-
Readers interested in the theoretical justification of the numbered empty space (which is location 10 in the
results in this paper can consult [10, 11, 13, 14, 1]. address region). This causes a later collision when DAVE
hashes to position 10, so DAVE is placed in slot 8 at the
end of the chain containing DONNA and MARK. The
method derives its name from this "coalescing" of rec-
2. The Coalesced Hashing Algorithm ords with different hash addresses into single chains.
The average number of probes per search shows
The algorithm works like this: Given a record with marked improvement in Fig. l(b), even though coalesc-
key K, the algorithm searches for it in the hash table, ing has occurred. Intuitively, the larger address region
starting at location hash(K) and following the links in spreads out the records more evenly and causes fewer
the chain. If the record is present in the table, then it is collisions, i.e., the hash function can be thought of as
found and the search is successful; otherwise, the end of "shooting" at a bigger target. The cellar is now too small
the chain is reached and the search is unsuccessful. For to store these fewer colliders, so it overflows. Fortunately,
simplicity, we assume that the record is inserted when- this overflow occurs late in the game, and the pileup
ever the search ends unsuccessfully, according to the phenomenon of coalescing is not significant enough to
following rule: If position hash(K) is empty, then the counteract the benefits of a larger address region. How-
record is stored at that location; else, it is placed in the ever, in the extreme case when M = M ' = 11 and there
largest-numbered empty slot in the table and is linked to is no cellar (which we call standard coalesced hashing),
the end of the chain. This has the effect of putting the coalescing begins too early and search time worsens (as
first M ' - - M colliders into the cellar. typified by Figure l(c)). Determining the optimum ad-
Coalesced hashing is a generalization of the well- dress factor fl = M/M' is a major focus of this paper.
known separate (or direct) chaining method. The sepa- The first order of business before we can start a
rate chaining method halts with overflow when there is detailed study of the coalesced hashing method is to
no more room in the cellar to store a collider. The formalize the algorithm and to define reasonable
example in Fig. 1(a) can be considered to be an example measures of search performance. Let us assume that each

Fig. 1. Coalesced hashing, M ' = 11, N = 8. T h e sizes of the address region are (a) M = 8, (b) M = 10, a n d (c) M = I I .
(a) (b) (c)
address size = 8 a d d r e s s s i z e = 10 a d d r e s s size = 11
1 JEFF 1 A.L. : 1
2 AUDREY 2 2
3 3 JEFF 3 AUDREY
4 DONNA 4 D N A
O N ~ 4 MARK
5 A.L. 5 5 AL
6 6 A D E
U RY 6
7 TOOTIE 7 7 DAVE ~_
8 8 JEFF
(9) DAVE 9 AL DONNA i
(10) MARK i0 MARK / i° TOOTLE
(Ii) AL (11) TOOTLE < 11 A.L.

Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVE
(a) s 2 2 7 4 5 1 2
Hash Addresses: (b) 1 6 9 1 4 4 3 10
(o) 11 a 5 3 10 4 10 9

average # probes per successful search: (a) 1 2 / 8 = 1.5. (b) l l / 8 = 1.375. (c) 1 4 / 8 = 1.75.

912 Communications D e c e m b e r 1982
of V o l u m e 25
the A C M N u m b e r 12

of the M ' contiguous slots in the coalesced hash table In this paper, we concern ourselves with measuring
has the following organization: the searching phase of Algorithm C and ignore for the
most part the insertion time in steps C5 and C6. (The
E time for step C5 is not significant, because the total
M number of times R is decremented over the course of all
P KEY other fields LINK the insertions cannot be more than the number of in-
T serted records; hence, the amortized expected number of
Y decrements is at most 1. The decrementing operation can
also be done in parallel with steps C 1-C4.) Our primary
For each value of i between 1 and M', E M P T Y [i] is a measure of search performance is the number of probes
one-bit field that denotes whether the ith slot is unused, per search, which is the number of different table slots
KEY[i] stores the key (if any), and LINK[i] is either the that are accessed while searching. In Algorithm C, this
index to the next spot in the chain or else the null value quantity is equal to
0.
The algorithms in this article are written in the max{ 1, number of times step C3 is performed}
English-like style used by Knuth in order to make them For example, in Fig. l(b), the unsuccessful searches for
readily understandable to all and to facilitate compari- keys A.L. and TOOTIE (immediately prior to their inser-
sons with the algorithms contained in [7, 4, 12]. Block- tions) each took one probe, while a successful search for
structured languages, like P L / I and Pascal, are good for DAVE would take two probes.
expressing complicated program modules; however, they The average performance of the algorithm is ob-
are not used here, because hashing algorithms are so tained by assuming that all searches and insertions are
short that there is no reason to discriminate against those random. The Appendix contains a discussion of the
who are not comfortable with such languages. probability model as well as the formulas for the ex-
Algorithm C (Coalesced hashing search and insertion). pected number of probes in unsuccessful and successful
This algorithm searches an M'-slot hash table, looking searches.
for a given key K. If the search is unsuccessful and the
table is not full, then K is inserted.
3. Assembly Language Implementation
The size of the address region is M; the hash function
hash returns a value between 1 and M (inclusive). For
Even though probe-counting gives us a good idea of
convenience, we make use of slot 0, which is always
search performance, other factors (such as the complexity
empty. The global variable R is used to find an empty
of the search loop and the overhead is computing the
space whenever a collision must be stored in the table.
hash address) also affect the running time when Algo-
Initially, the table is empty, and we have R = M ' + 1;
rithm C is programmed for a real computer. For com-
when an empty space is requested, R is decremented
pleteness, we optimize the running time of assembly
until one is found. We assume that the following initial-
language versions of coalesced hashing.
izations have been made before any searches or inser-
We choose to program in assembly language rather
tions are performed: M ~ [tiM'], for some constant than in some high-level language like Fortran, PL/I, or
0 < fl _< 1; EMPTY[i] ,,-- true, for all 0 _< i _< M'; and Pascal, in order to achieve maximum possible efficiency.
R ~ - - M ' + 1. Top efficiency is important in large-scale applications of
C1. [Hash.] Set i ~-- hash(K). (Now 1 _< i _< M.) hashing, but it can also be achieved in smaller systems
C2. [Is there a chain?] If EMPTY[i], then go to step C6. with little extra effort, because hashing algorithms are so
(Otherwise, the ith slot is occupied, so we will look short that implementing them (even in assembly lan-
at the chain of records that starts there.) guage) is easy. We use a hypothetical language based on
C3. [Compare.] I f K = KEY[i], the algorithm terminates Knuth's Mix [6] because its features are similar to most
successfully. well-known machines and its inherent simplicity allows
C4. [Advance to next record.] If LINK[i] ~ O, then set us to write programs in clear and concise form.
i ~ LINK[i] and go back to step C3. Program C below is a Mix-like implementation of
C5. [Find empty slot.] (The search for K in the chain Algorithm C. Liberties have been taken with the lan-
was unsuccessful, so we will try to find an empty guage for purposes of clarity; the actual Mxx code appears
table slot to store K.) Decrease R one or more times in [10]. The program is written in a five-column format:
until EMPTY[R] becomes true. I f R = 0, then there the first column gives the line numbers, the second
are no more empty slots, and the algorithm termi- column lists the instruction labels, the third column
nates with overflow. Otherwise, append the Rth cell contains the assembly language instructions, the fourth
to the chain by setting LINK[i] ~-- R; then set i column counts the number of times the instructions are
R. executed, and the last column is for comments that
C6. [Insert new record.] Set EMPTY[i] <--false, KEY[i] explain what the instructions do. The syntax of the
K, LINK[i] ~-- O, and initialize the other fields in commands should be clear to those familiar with assem-
the record. • bly language programming. The four memory registers

of Volume 25
the ACM N u m b e r 12

used in Program C are named rA, rX, rI, and rJ. The field: empty slots are marked by a - 1 in the L I N K field
reference KEY(I) denotes the contents of the m e m o r y of that slot. Null links are denoted by a 0 in the L I N K
location whose address is the value of K E Y plus the field. The variable R and the key K are stored in memory
contents of rI. (This is KEY[i] in the notation of Algo- locations R and K. Registers rI and rA are used to store
rithm C.) the values of i and K. Register rJ stores either the value
Program C (Coalesced hashing search and insertion). of LINK[i] or R. The instruction labels SUCCESS and
This program follows the conventions of Algorithm C, O V E R F L O W are for exiting and are assumed to lie
except that the E M P T Y field is implicit in the L I N K somewhere outside this code.
I
01 S T A R T LD X, K 1 Step C1. Load rX with K.
02 ENT A, 0 1 Enter 0 into rA.
03 DIV =M= 1 rA ~ [K/M], rX ~-- K mod M.
04 ENT I, X 1 Enter rX into rI.
05 INC I, 1 1 Increment rI by 1.
06 LD A, K 1 Load rA with K.
07 LD J, L I N K ( I ) 1 Step C2. Load rJ with LINK[i].
08 JN J, STEP6 1 J u m p to STEP6 if LINK[i] < O.
09 CMP A, KEY(l) A Step C3. C o m p a r e K with KEY[i].
10 JE SUCCESS A Exit (successessfully) if K = KE Y[i].
11 JZ J, STEP5 A - SI J u m p to STEP5 if LINK[i] = O.
12 STEP4 ENT I, J C - 1 Step C4. Enter rJ into rI.
13 CMP A, KEY(I) C - 1 Step C3. C o m p a r e K with KEY[i].
14 JE SUCCESS C- 1 Exit (successessfully) if K = KEY[i].
15 LD J, L I N K ( I ) C - 1 - $2 Load rJ with LINK[i].
16 JNZ J, STEP4 C - 1 - $2 J u m p to STEP4 if LINK[i] ~ O.
17 STEP5 LD J, R A - S Step C5. Load rJ with R.
18 DEC J, 1 T Decrement R by 1.
19 LD X, L I N K ( J ) T Load rX with LINK[R].
20 JNN X, .-2 T G o back two steps if LINK[R] >_ O.
21 JZ J, O V E R F L O W A - S Exit (with overflow) if R = 0.
22 ST J, L I N K ( I ) A - S Store R in LINK[i]
23 ENT I, J A - S Enter rJ into rI.
24 ST J, R A - S Update R in memory.
25 STEP6 ST 0, L I N K ( I ) 1- S Step C6. Store 0 in LINK[i].
26 ST A, KEY(I) 1- S Store K i~ KEY[i]. •

The execution time is measured in MIX units of time, The fourth column of Program C expresses the num-
which we denote u. The n u m b e r of time units required ber of times each instruction is executed in terms of the
by an instruction is equal to the number of m e m o r y quantities
references (including the reference to the instruction
C = n u m b e r of probes per search.
itself). Hence, the LD, ST, and CMP instructions each
A = 1 if the initial probe found an occupied slot,
take two units of time, while ENT, INC, DEC, and the
0 otherwise.
j u m p instructions require only one time unit. The divi-
S = 1 if successful, 0 if unsuccessful.
sion operation used to compute the hash address is an
T = n u m b e r of slots probed while looking for an empty
exception to this rule; it takes 14u to execute.
space.
The running time of a MIX program is the weighted
sum We further decompose S into S 1 + $2, where S 1 = 1 if
the search is successful on the first probe, and S1 = 0
# times '~// # time units '~ otherwise. By formula (1), the total running time of the
the i n s t r u c t i o n ~ required by ~ (1) searching phase is
each instruction is executed / t h e instruction]
in the program
(7C + 4A + 17 - 3S + 2 S l ) u (2)
This is a somewhat simplistic model, since it does not and the insertion of a new record after an unsuccessful
make use of cache or buffered m e m o r y for fast access of search (when S = 0) takes an additional (SA + 4 T + 4)u.
frequently used data, and since it ignores any interven- The average running time is the expected value of (2),
tion by the operating system. But it places all hashing assuming that all insertions and searches are random.
algorithms on an equal footing and gives a good indi- The formula can be obtained by replacing the variables
cation of relative merit. in Eq. (2) with their expected values.

of V o l u m e 25
the ACM N u m b e r 12

4. Tuning fl to Obtain Optimum Performance 4.2 MIX Running Times
Optimizing the MIX execution times could be tricky,
The purpose of the analysis in [10, 11, 13] is to show in general, because the formulas might have local as well
how the average-case performance of the coalesced hash- as global minima. Then when we set the derivatives
ing method varies as a function of the address factor fl equal to 0 in order to find floPr, there might be several
= M / M ' and the load factor a = N/M'. In this section, roots to the resulting equations. The crucial fact that lets
for eachfixed value of a, we make use of those results in us apply the same optimization techniques we used above
order to "tune" our choice of fl and speed up the search for the number of probes is that the formulas for the MIX
times. Our two measures of performance are the expected running times are well-behaved, as shown in the Appen-
number of probes per search and the average running dix. By that we mean that each formula is minimized at
time of assembly language versions. In the latter case, a unique floPT, which occurs either at the endpoint a =
we study a MIX implementation in detail, and then show Aft or at the unique point in the "a > Aft" region where
how to apply what we learn to other assembly languages. the derivative w.r.t, fl is 0.
Unfortunately, there is no single choice of fl that The optimization procedure is the same as before.
yields best results: the optimum choice flOPWis a function The expected values of formulas (A4) and (A5), which
of the load factor a and it is even different for unsuc- give the MIX running times for unsuccessful and success-
cessful and successful searches. The section concludes ful searches, are functions of the three variables a, fl, and
with practical tips on how to initialize ft. In particular, A. We substitute Eq. (3) into the expected running times
we shall see that the choice fl = 0.86 works well in most in order to express fl in terms of A. For several different
situations. load factors c~ and for each type of search, we find the
value of A that minimizes the formula, and then we
4.1 Number of Probes Per Search retranslate this value via Eq. (3) to get flOPW.Figure 2(b)
For each fixed value of a, we want to find the values graphs these optimum values flOPW as a function of a;
flOPT that minimize the expected number of search probes spline interpolation was used to fill in the gaps. As in the
in unsuccessful and successful searches. Formulas (A1) previous section, the formulas for the average unsuccess-
and (A2) in the Appendix express the average number ful and successful search times yield different optimum
of probes per search as a function of three variables: the address factors. For the successful search case, notice
load factor c~ = N/M', the address factor fl = M/M', how closely flOPT agrees with the corresponding values
and a new variable A = L/M, where L is the expected that minimize the expected number of probes.
number of inserted records needed to make the cellar
become full. The variables fl and A are related by the
formula
1 Fig. 2. The values //OPT that optimize search performance for the
e -~ + A = - (3) following three measures: (a) the expected number of probes per
B search, (b) the expected running time of Program C, and (c) the
expected assembly language running time for large keys.
Formulas (A1) and (A2) each have two cases, "a _<
Aft" and "a _> Aft," which have the following intuitive 1.o ~ 1.0
meanings: The condition a < Aft means that with high
probability not enough records have been inserted to fill
up the cellar, while the condition a > Aft means that
enough records have been inserted to make the cellar
almost surely full.
The optimum address factor flOPW is always located Successful
somewhere in the "a _> Aft" region, as shown in the
Appendix. The rest of the optimization procedure is a ~.~ 0.9 0.9
straightforward application of differential calculus. First,
we substitute Eq. (3) into the "a _> Aft" cases of the
formulas for the expected number of probes per search
in order to express them in terms of only the two ._E
variables a and A. For each nonzero fixed value of a, the (b) ~
Uns....... ful ~ k" ~

formulas are convex w.r.t. A and have unique minima.
We minimize them by setting their derivatives equal to
0. Numerical analysis techniques are used to solve the 0,8
resulting equations and to get the optimum values of A
for several different values of a. Then we reapply Eq. (3)
to express the optimum points in terms of ft. The results
are graphed in Fig. 2(a), using spline interpolation to fill 0 0.1 0.2 0.3 0,4 0,5 0.6 0.7 0.8 0,9 1.0

in the gaps. ].oad]:actor,a

of Volume 25
the ACM Number 12

4.3 Applying the Results to Other Implementations One strategy is to pick fl = 0.782, which minimizes
Our MIX analysis suggests two important principles the expected number of probes per unsuccessful search
to be used in finding/?OPT for a particular implementa- as well as the average MIX unsuccessful search time when
tion of coalesced hashing. First, the formulas for the the table is full (i.e., load factor a = l), as indicated in
expected number of times each instruction in the pro- Fig. 2. This choice of/3 yields the best absolute bound
gram is executed (which are expressed for Program C in on search performance, because when the table is full,
terms of C, A, S, S 1, $2, and T) may have the two cases, search times are greatest and unsuccessful searches av-
"a --< )~/3" and "a _> )~/3," but probably not more. erage slightly longer than successful ones. Regardless of
Second, the same optimization process as above can the load factor, the expected number of probes per search
be used to find /3OPT, because the formulas for the would be at most 1.79, and the average MIX searching
running times should be well-behaved for the following time would be bounded by 33.52u.
reason: The main difference between Program C and Another strategy is to pick some compromise address
another implementation is likely to be the relative time factor that leads to good overall performance for a large
it takes to process each key. (The keys are assumed to be range of load factors. A reasonable choice is/3 = 0.86;
very small in the MIX version.) Thus, the unsuccessful then the unsuccessful searches are optimized (over all
search time for another implementation might be ap- other values o f f l ) when the load factor is =0.68 (number
proximately of probes) and ,~0.56 (MIX), and the successful search
performance is optimized at load factors -~0.94 (number
[(2x + 5)C + (2x + 2)A + ( - 2 x + 19)]u' (4)
of probes) and -~0.95 (MIX).
where u' is the standard unit of time on the other Figures 3 through 6 graph the expected search per-
computer and x is how many times longer it takes to formance of coalesced hashing as a function of a for
process a key (multiplied by u/u'). Successful search both types of searches (unsuccessful and successful) and
times would be about for both measures of performance (number of probes
and MiX running time). The C1 curve corresponds to
[(2x + 5 ) C + 18 + 2 S 1 ] u ' (5)
standard coalesced hashing (i.e., fl = l); the Co.86 line is
Formulas (4) and (5) were calculated by increasing the our compromise choice fl = 0.86; and the dashed line
execution times of the key-processing steps 9 and 13 in CoPx represents the best possible search performance
Program C by a factor of x. (See formulas (A4) and (A5) that could be achieved by tuning (in which fl is optimized
for the x = 1 case.) We ignore the extra time it takes to for each load factor).
load the larger key and to compute the hash function, Notice that the value/3 = 0.86 yields near-optimum
since that does not affect the optimization. search times once the table gets half-full, so this compro-
The role of C in formula (4) is less prevalent than in mise offers a viable strategy. Of course, if some prior
(A4) as x gets large: the ratio of the coefficients of C and knowledge about the types and frequencies of the
A decreases from 7/4 in (A4) and approaches the limit searches were available, we could tailor our choice of/3
2/2 = 1 in formula (4). Even in this extreme case, to meet those specific needs.
however, computer calculations show that the formula
for the average running time is well-behaved. The values
of/3OPT that minimize formula (4) when x is large are 5. Comparisons
graphed in Fig. 2(c).
For successful searches, however, the value of C more In this section, we compare the searching times of the
strongly dominates the running times for larger values of coalesced hashing method with those from a represent-
x, so the limiting values offloPw in Fig. 2(c) coincide with ative collection of hashing schemes: standard coalesced
the ones that minimize the expected number of probes hashing (C1), separate chaining (S), separate chaining
per search in Fig. 2(a). Figure 2(b) shows that the with ordered chains (SO), linear probing (L), and double
approximation is close even for the case x = l, which is hashing (D). Implementations of the methods are given
Program C. in [10].
These methods were chosen because they are the
4.4 How to Choose fl most well-known and since they each have implemen-
It is important to remember that the address region tations similar to that of Algorithm C. Our comparisons
size M = [tiM'] must be initialized when the hash table are based both on the expected number of probes per
is empty and cannot change thereafter. Unfortunately, search as well as on the average MIX running time.
the last two sections show that each different load factor Coalesced hashing performs better than the other
a requires a different optimum address factor /3OPT; in methods. The differences are not so dramatic with the
fact, the values of flOPW differ for unsuccessful and suc- MIX search times as with the number of probes per
cessful searches. This means that optimizing the average search, due to the large overhead in computing the hash
unsuccessful (or successful) search time for a certain load address. However, if the keys were larger and compari-
factor a will lead to suboptimum performance when the sons took longer, the relative MIX savings would closely
load factor is not equal to a. approximate the savings in number of probes.

of Volume 25
the ACM Number 12

Fig. 3. The average number of probes per unsuccessful search, as M Fig. 4. The average number of probes per successful search, as M and
and M' --~ ~, for coalesced hashing (C,, Co.86, COPT for fl = 1, 0.86, M' ---> o0, for coalesced hashing (C,, C0.~6, COPT for fl = 1, 0.86, floP'r),
flovr), separate chaining (S), separate chaining with ordered chains separate chaining (S), separate chaining with ordered chains (SO),
(SO), linear probing (L), and double hashing (D). linear probing (L), and double hashing (D).
25

L / l 2.5

2.0 Y 2.0 ~ 2.0

1.5 / / 1.5 ;. L5
/
/
~ s ~ S, SO

~ so

I.O / ~ 1.0
0 0.1 0,2 0.3 0.4 0.5 0.h 0.7 0.8 0.9 1.0 0 0. I 02 0.3 0.4 0,5 06 0.7 08 0.9 1,0
l.oadfactor, a or l.oadl,,ctor,a or

5.1 Standard Coalesced Hashing (C1) "tuned" coalesced hashing are identical. Figures 3 and
Standard coalesced hashing is the special case of 4 show that the savings in number of probes per search
coalesced hashing for which fl = 1 and there is no cellar. can be as much as 14 percent (unsuccessful) and 6
This is obviously the most realistic comparison that can percent (successful). In Figs. 5 and 6, the corresponding
be made, because except for the initialization of the savings in MIX searching time is 6 percent (unsuccessful)
address region size, standard coalesced hashing and and 2 percent (successful).
Fig. 5. The average Mix execution time per unsuccessful search, as Fig. 6. The average Mix execution time per successful search, as
M' ---> oo, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, flOPT), M' --> ~, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, floPT),
separate chaining (S), separate chaining with ordered chains (SO), separate chaining (S), separate chaining with ordered chains (SO),
linear probing (L), and double hashing (D). linear probing (L), and double hashing (D).
40
40 L~D 40
{
Cl 35
35 35

30
~o ~ ~ 30
.= ---

x
~. 25 25
~ J

20 20 20
0 0,1 0.2 0,3 0.4 0.5 0.6 0.7 0,8 0.9 1.0 0 0.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Load Factor, a I.oadFaca)r, a
of Volume 25
the A C M Number 12

5.2 Separate (or Direct) Chaining (S) cessful search time of Program SO is worse than Program
The separate chaining method is given an unfair C's, and in real-life situations, the difference is likely to
advantage in Figs. 3 and 4: the number of probes per be more apparent, because records that are inserted first
search is graphed as a function of ~ = N / M rather than tend to be looked up more often and should be kept near
a = N / M ' and does not take into account the number of the beginning of the chain, not rearranged.
auxiliary slots used to store colliders. In order to make Method SO has the same storage limitations as the
the comparison fair, we must adjust the load factor separate chaining scheme (i.e., the table usually over-
accordingly. flows when N = M = 0.731M'), whereas coalesced
Separate chaining implementations are designed of- hashing can obtain full storage utilization.
ten to accommodate about N = M records; an average
of M(1 - 1 / M ) M ~ M / e auxiliary slots are needed to 5.4 Linear Probing (L) and Double Hashing (D)
store the colliders. The total table size is thus M ' = M When searching for a record with key K, the linear
+ M/e. Solving backwards for M, we get M = 0.731M'. probing method first checks location hash(K), and if
In other words, we may consider separate chaining to be another record is already there, it steps cyclically through
the special case of coalesced hashing for which /3 -~ the table, starting at location hash(K), until the record is
0.731, except that no more records can be inserted once found (successful search) or an empty slot is reached
the cellar overflows. Hence, the adjusted load factor is (unsuccessful search). Insertions are done by placing the
a = 0.731~, and overflow occurs when there are around record into the empty slot that terminated the unsuc-
N = M = 0.73 I M ' inserted records. (This is a reasonable cessful search. Double hashing generalizes this by letting
space/time compromise: if we make M smaller, then the cyclic step size be a function of K.
more records can usually be stored before overflow We have to adjust the load factor in the opposite
occurs, but the average search times blow up; if we direction when we compare Algorithm C with methods
increase M to get better search times, then overflow L and D, because the latter do not require L I N K fields.
occurs much sooner, and many slots are wasted.) For example, if we suppose that the L I N K field com-
If we adjust the load factors in Figs. 3 and 4 in this prises ¼ of the total record size in a coalesced hashing
way, Algorithm C generates better search statistics: the implementation, then the search statistics in Figs. 3 and
expected number of probes per search for separate chain- 4 for Algorithm C with load factor a should be compared
ing is -~ 1.37 (unsuccessful) and -~ 1.5 (successful) when against those for linear probing and double hashing with
the load factor 6 is 1, while that for coalesced hashing is load factor (¼)a. In this case, the average number of
1.32 (unsuccessful) and -~ 1.44 (successful) when the probes per search is still better for coalesced hashing.
load factor a =/3~ is equal to 0.731. However, the L I N K field is often much smaller than
The graphs in Figs. 5 and 6 already reflect this load the rest of the record, and sometimes it can be included
factor adjustment. In fact, the MIX implementation of in the table at virtually no extra cost. The Mix imple-
separate chaining (Program S in [10]) is identical to mentation Program C in [10] assumes that the raix field
Program C, except that /3 is initialized to 0.731 and can be squeezed into the record without need of extra
overflow is signaled automatically when the cellar runs storage Space. Figures 5 and 6, therefore, require no load
out of empty slots. Program C is slightly quicker in MIX factor adjustment.
execution time than Program S, but more importantly, To balance matters, the M~X implementations of lin-
the coalesced hashing implementation is more space ear probing and double hashing, which are given in [10]
efficient: Program S usually overflows when a = 0.731, and [7], contain two code optimizations. First, since
while Program C can always obtain full storage utiliza- L I N K fields are not used in methods L and D, we no
tion a = 1. This confirms our intuition that coalesced longer need 0 to denote a null L I N K , and we can
hashing can accomodate more records than the separate renumber the table slots from 0 to M ' - 1; the hash
chaining method and still outperform separate chaining function now returns a value between 0 and M ' - 1.
before that method overflows. This makes the hash address computation faster by lu,
because the instruction INC I, 1 can be eliminated.
5.3 Separate Chaining with Ordered Chains (SO) Second, the empty slots are denoted by the value 0 in
This method is a variation of separate chaining in order to make the comparisons in the inner loop as fast
which the chains are kept ordered by key value. The as possible. This means that records are not allowed to
expected number of probes per successful search does have a key value of 0. The final results are graphed in
not change, but unsuccessful searches are slightly Figs. 5 and 6. Coalesced hashing clearly dominates when
quicker, because only about half the chain needs to be the load factor is greater than 0.6.
searched, on the average.
Our remarks about adjusting the load factor in Figs.
3 and 4 also apply to method SO. But even after that is 6. Deletions
done, the average number of probes per unsuccessful
search as well as the expected MIX unsuccessful search It is often useful in hashing applications to be able to
time is slightly better for this method than for coalesced delete records when they no longer logically belong to
hashing. However, as Fig. 6 illustrates, the average suc- the set of objects being represented in the hash table. For

of Volume 25
the A C M N umbe r 12

example, in an airlines reservations system, passenger TOOTIE rehashes to the hole in location 10, so TOOTIE
records are often expunged soon after the flight has moves up to plug the hole, leaving a new hole in position
taken place. 9. Next, DONNA collides with AUDREY during rehashing,
One possible deletion strategy often used for linear so DONNA remains in slot 8 and is linked to AUDREY.
probing and double hashing is to include a special one- Then MARK also collides with AUDREY; we leave MARK in
bit D E L E T E D field in each record that says whether or position 7 and link it to DONNA, which was formerly at
not the record has been deleted. The search algorithm the end of AUDREY'Shash chain. The record JEFF rehashes
must be modified to treat each "deleted" table slot as if to the hole in slot 9, so we move it up to plug the hole,
it were occupied by a null record, even though the entire and a new hole appears in position 6. Finally, DAVE
record is still there. This is especially desirable when rehashes to position 9 and joins JEVF'S chain.
there are pointers to the records from outside the table. Location 6 is the current hole position when the
I f there are no such external pointers to worry about, deletion algorithm terminates, so we set EMPTY[6] ~--
the "deleted" table slots can be reused for later insertions: true and return it to the pool of empty slots. However,
Whenever an empty slot is needed in step C5 of Algo- the value of R in Algorithm C is already 5, so step C5
rithm C, the record is inserted into the first "deleted" will never try to reuse location 6 when an empty slot is
slot encountered during the unsuccessful search; if there needed.
is no such slot, an empty slot is allocated in the usual We can solve this problem by using an available-
way. However, a certain percentage of the "deleted" slots space list in step C5 rather than the variable R; the list
probably will remain unused, thus preventing full storage must be doubly linked so that a slot can be removed
utilization. Also, insertions and deletions over a pro- quickly from the list in step C6. The available-space list
longed period would cause the expected search times to does not require any extra space per table slot, since we
approximate those for a full table, regardless of the can use the K E Y and L I N K fields of the empty slots for
n u m b e r of undeleted records, because the "deleted" the two pointer fields. (The K E Y field is much larger
records make the searches longer. than the L I N K field in typical implementations.) For
I f we are willing to spend a little extra time per clarity, we rename the two pointer fields N E X T and
deletion, we can do without the D E L E T E D field by P R E V . Slot 0 in the table acts as the d u m m y start of the
relocating some of the records that follow in the chain. available-space list, so NEXT[O] points to the first actual
The basic idea is this: First, we find the record we want slot in the list and PREV[O] points to the last. Before
to delete, mark its table slot empty, and set the L I N K any records are inserted into the table, the following
field of its predecessor (if any) to the null value 0. Then extra initializations must be made: NEXT[O] <--- M '
we use Algorithm C to reinsert each record in the re- P R E V [ M ' ] ,--- 0; and N E X T [ i ] ~ i - 1 and P R E V [ i -
mainder of the chain, but whenever an empty slot is 1] ~ i, for 1 _< i _< M'. We replace steps C5 and C6 by
needed in step C5, we use the position that the record C5. [Find empty slot.] (The search for K in the chain
already occupies. was unsuccessful, so we will try to find an empty
This method can be illustrated by deleting AL from table slot to store K.) I f the table is already full (i.e.,
location l0 in Fig. 7(a); the end result is pictured in Fig. NEXT[O] = 0), the algorithm terminates with over-
7(b). The first step is to create a hole in position l0 where flow. Otherwise, set L I N K [ i ] *---NEXT[O] and i *--
AL was, and to set AUDREY'S L I N K field to 0. Then we NEXT[0].
process the remainder of the chain. The next record C6. [Insert new record.] Remove the ith slot from the

Fig. 7. (a) Inserting the eight records; (b) Inserting all the records except AL.
(a) (b)
1 AUDREY 1 AUDREY
2 2
8 3
4 4 I
5 .. DAVE 5 DAVE
6 JEFF 6
7 MARK -~ 7 MARK
8 DONNA -I 8 DONNA
9 ,TOOTlEAL ""~J-1 9 JEFF
i0 I0 TOOTIE
Ii A.L. Ii A.L.

Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE
Hash Addresses: 11 1 1 10 1 1 9 9

of Volume 25
the ACM Number 12

available-space list by setting PREV'[NEXT[i]] ~-- resulting table is better-than-random: the average search
PREV[i] and N E X T [ P R E V [ i ] ] ~-- NEXT[i]. Then times after N random insertions and one deletion are
set E M P T Y [ i ] ~-- false, KEY[i] ~ K, L I N K [ i ] ~-- sometimes better (and never worse) than they would be
0, and initialize the other fields in the record. with N - 1 random insertions alone. Whether or not this
remains true after more than one deletion is an open
The following deletion algorithm is analyzed in
problem.
detail in [10] and [14].
If this deletion algorithm is used when there is a
Algorithm CD (Deletion with coalesced hashing). This cellar (i.e., fl < 1), we can modify it so that whenever a
algorithm deletes the record with koy K from a coalesced hole appears in the cellar during the execution of Algo-
hash table constructed by Algorithm C, with steps C5 rithm CD, the next noncellar record in the chain moves
and C6 modified as above. up to plug the hole. Unfortunately, even with this mod-
This algorithm preserves the important invariant that ification, the algorithm does not break up chains well
K is stored at its hash address if and only if it is at the enough to preserve randomness. It seems possible that
start of its chain. This makes searching for K's predeces- search performance may remain very good anyway.
sor in the chain easy: if it exists, then it must come at or Analytic and empirical study is needed to determine just
after position hash(K) in the chain. "how far from r a n d o m " the search times get after dele-
C D I . [Search for K.] Set i ~ hash(K). If E M P T Y [ i ] , tions are performed.
Two remarks should be made about implementing
then K is not present in the table and the algorithm
this modified deletion algorithm. In step CD6, the empty
terminates. Otherwise, if K = KEY[i], then K is at
slot should be returned to the start of the available-space
the start of the chain, so go to step CD3.
list when the slot is in the cellar; otherwise, it should be
CD2. [Split chain in two.] (K is not at the start of its
placed at the end. This has the effect of giving cellar slots
chain.) Repeatedly set P R E D ~-- i and i *--
higher priority on the available-space list. Second, if a
L I N K [ i ] until either i = 0 or K = KEY[i]. I f i =
cellar slot is freed by a deletion and then reallocated
0, then K is not present in the table, and the
during a later insertion, it is possible for chain to go in
algorithm terminates. Else, set L I N K [ P R E D ]
and out of the cellar more than once. Programmers
0.
should no longer assume that a chain's cellar slots im-
CD3. [Process remainder of chain.] (Variable i will walk
mediately follow the start of the chain.
through the successors of K in the chain.) Set
H O L E ~ i, i ~ LINK[i], L I N K [ H O L E ] ~-- O.
Do step CD4 zero or more times until i = 0. Then
7. Implementations and Variations
go to step CD5.
CD4. [Rehash record in ith slot.] Set j ~ hash(KEY[i]).
Most important searching algorithms have several
I f j = H O L E , we move up the record to plug the
different implementations in order to handle a variety of
hole by setting K E Y [ H O L E ] ~-- KEY[i] and
applications; coalesced hashing is no exception. We have
H O L E ~ i. Otherwise, we link the record to the
already discussed some modifications in the last section
end of its hash chain by doing the following: set
in connection with deletion algorithms. In particular, we
j .-- L I N K [ j ] zero or more times until L I N K [ j ]
needed to use a doubly linked available-space list so that
= 0; then set L I N K [ j ] *-- i. Set k *-- LINK[i],
the empty slots could be added and removed quickly.
LINK[i] ~ O, and i *-- k. Repeat step CD4 unless
Thus, the cellar need not be contiguous. Another strategy
i=0.
to handle a noncontiguous cellar is to link all the table
CDS. [Mark slot H O L E empty.] Set E M P T Y [ H O L E ]
slots together initially and to replace "Decrease R " in
true. Place H O L E at the start of the available-
step C5 of Algorithm C with "Set R *-- L I N K [ R ] . " With
space list by setting N E X T [ H O L E ] ~ NEXT[O],
either modification, Algorithm C can simulate the sepa-
PRE V [ H O L E ] ~-- O, P R E V[NEXT[O]] ~ H O L E ,
rate chaining method until the cellar empties; subsequent
NEXT[O] ~ H O L E . •
colliders can be stored in the address region as usual.
Algorithm CD has the important property that it Hence, coalesced hashing can have the benefit of dy-
preserves randomness for the special case of standard namic allocation as well as total storage utilization.
coalesced hashing (when M = M ' ) , in that deleting a Another c o m m o n data structure is to store pointers
record is in some sense like never having inserted it. The to the fields, rather than the fields themselves, in the
"sense" is strong enough so that the formulas for the table slots. For example, if the records are large, we
average search times are still valid after deletions are might want to store only the key and link values in each
performed. Exactly what preserving randomness means slot, along with a pointer to where the rest of the record
is explained in detail in [14]. is located. We expand upon this idea later in this section.
We can speed up the rehashing phase in the latter If we are willing to do extra work during insertion
half of step CD4 by linking the record into the chain and if the records are not pointed to from outside the
immediately after its hash address rather than at the end table, we can modify the insertion algorithm to prevent
of the chain. When this modified deletion algorithm is the chains from coalescing: W h e n a record R1 collides
called on a random standard coalesced hash table, the during insertion with another record Rz that is not at the

of Volume 25
the ACM Number 12

start of the chain, we store R, at its hash address and rithm (Algorithm C in Sec. 2) as the late-insertion
relocate R2 to some other spot. (The LINK field of R2's method.
predecessor must be updated.) The size of the records Early-insertion can be used even if we do not have a
should not be very large or else the cost of rearrangement priori knowledge about the record's presence, in which
might get prohibitive. There is an alternate strategy that case the entire chain must be searched in order to verify
prevents coalescing and does not relocate records, but it that the record is not already stored in the table. We can
requires an extra link field per slot and the searches are implement this form of early-insertion by making the
slightly longer. One link field is used to chain together following two modifications to Algorithm C. First, we
all the records with the same hash address. The other add the assignment "Set j ~-- i" at the end of step C2, so
link field contains for slot i a pointer to the start of the that j stores the hash address hash(K). The second
chain of records with hash address i. Much of the space modification replaces the last sentence of step C5 by
for the link fields is wasted, and chains m a y start one "Otherwise, link the R t h cell into the chain immediately
link away from their hash address. Resources could be after the hash addressj by setting LINK[R] ~--LINK[j],
put to better use by using coalesced hashing. LINK[j] ~ R; then set i ~ R."
This section is devoted to the more nonobvious im- Each chain of records formed using early-insertion
plementations of coalesced hashing. First, we describe contains the same records as the corresponding chain
alternate insertion strategies and then conclude with formed by late-insertion. Since the length of a random
three applications to external searching on secondary unsuccessful search depends only on the number of
storage devices. A scheme that allows the coalesced hash records in the chain between the hash address and the
table to share m e m o r y with other data structures can be end of the chain, and since all the records are in the
found in [ 12]. A generalization of coalesced hashing that address region when there is no cellar, it must be true
uses nonuniform hash functions is described in [13]. that the average n u m b e r of probes per unsuccessful
search is the same for the two methods if there is no
7.1 Early-Insertion and Varied-lnsertion Coalesced cellar. However, the order of the records within each
Hashing chain m a y be different for early-insertion than for late-
I f we know a priori that a record is not already insertion. When there is no cellar, the early-insertion
present in the table, then it is not necessary in Algorithm algorithm causes the records to align themselves in the
C to search to the end of the chain before the record is chains closer to their hash addresses, on the average,
inserted: I f the hash address location is empty, the record than would be the case with late-insertion, so the ex-
can be inserted there; otherwise, we can link the record pected successful search times are better.
into the chain immediately after its hash address by A typical case is illustrated in Fig. 8. The record DAVE
rerouting pointers. We call this the early-insertion method collides with A.L. at slot 5. In Fig. 8(a), which uses late-
because the collider is linked "early" in the chain, rather insertion, DAVE is linked to the end of the chain contain-
than at the end. We will refer to the unmodified algo- ing A.L., whereas if we use early-insertion as in Fig. 8(b),

Fig. 8. Standard Coalesced Hashing, M = M ' = 11, N = 8. (a) Late-insertion; (b) Early-insertion.

(a) (b)
late-insertion early-insertion
a d d r e s s s i z e = 11 a d d r e s s s i z e = 11
1 AUDREY 1 AUDREY
2 2
3 DONNA S DONNA
4 JEFF 4 JE~
5 A.L. ~, 5 A.L.
6 6
7 7
8 ..DAVE DAVE
9 MARK ~ 9 MARK
10 TOOTIE 10 TOOTIE
ii AL / . 11 AL -I"

Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE
Hash Addresses: 5 1 5 10 3 11 4 5

ave. ]/probes per succ. search: ( a ) 1 3 / 8 ~ 1.63, ( b ) 1 2 / 6 = 1.5.

of Volume 25
the A C M N u m b e r 12

DAVE is linked into the chain at the point between A.L. identical to early-insertion. In the varied-insertion
and AL. The average successful search time in Fig. 8(b) method, the early-insertion strategy is used except when
is slightly better than in Fig. 8(a), because linking DAVE the cellar is full and the hash address of the inserted
into the chain immediately after A.L. (rather than at the record is the start of a chain that has records in the
end of the chain) reduces the search time for DAVE from cellar. In that case, the record is linked into the chain
four probes to two and increases the search time for AL immediately after the last cellar slot in the chain.
from two probes to three. The result is a net decrease of Figure 9(c) shows a typical hash table constructed
one probe. using varied-insertion. The cellar is already full when
One can show easily that this effect manifests itself the record DAVE is inserted. The hash address of DAVE is
only on chains of length greater than 3, so there is little 1, which is at the start of a chain that has records in the
improvement when the load factor a is small, since the cellar. Therefore, early-insertion is not used, and DAVE
chains are usually short. Recent theoretical results show is instead linked into the chain immediately after AL,
that the average number of probes per successful search which is the last record in the chain that is in the cellar.
is 5 percent better with early-insertion than with late- The average n u m b e r of probes per search is better for
insertion when there is no cellar and the table is full (i.e., varied-insertion than for both late-insertion and early-
a = 1), but is only 0.5 percent better when a = 0.5 insertion.
[1, 5]. A possible disadvantage of early-insertion is that The varied-insertion method incorporates the advan-
earlier colliders tend to be shoved to the rear by later tages of early-insertion, but without any of the drawbacks
ones, which m a y not be desirable in some practical described three paragraphs earlier. The records of a
situations when the records inserted first tend to be chain that are in the cellar always come immediately
accessed more often than those inserted later. Neverthe- after the start of the chain. The average n u m b e r of
less, early-insertion is an improvement over late-insertion probes per search for varied-insertion is always less than
when there is no cellar. or equal to that for late-insertion and early-insertion.
When there is a cellar, preliminary studies indicate For unsuccessful searches, the expected n u m b e r of
that search performance is probably worse with early- probes for varied-insertion and late-insertion are identi-
insertion than with Algorithm C, because a chain's rec- cal.
ords that are in the cellar now come at the end of the Research is currently underway to determine the
chain, whereas with late-insertion they come immedi- average search times for the varied-insertion method, as
ately after the start. In the example in Fig. 9(b), the well as to find the values of the o p t i m u m address factor
insertion of JEFF causes both cellar records AL and TOOTIE flOVV. We expect that the initialization fi ~-- 0.86 will be
to move one link further from their hash addresses. That preferred in most situations, as it is for late-insertion.
does not happen with late-insertion in Fig. 9(b). The resulting search times for varied-insertion should be
We shall now introduce a new variant, called varied- a slight improvement over late-insertion.
insertion, that can be shown to be better than both the The idea of linking the inserted record into the chain
late-insertion and early-insertion methods when there is immediately after its hash address has been incorporated
a cellar. When there is no cellar, varied-insertion is into the first modification of Algorithm CD in the last
Fig. 9. Coalesced Hashing, M ' = 11, M = 9, N = 8. (a) Late-insertion; (b) Early-insertion; and (c) Varied-insertion.
(a) (b) (o)
late-insertion early-insertion varie d-insertion
a d d r e s s size = 9 a d d r e s s size = 9 a d d r e s s size = 9
I A.L. 1 A.L. • 1 A.L.
2 , t
2 2
3 AUDREY i :
i
:
I
3 AUDREY -- ~, 3 AUDREY •
4 , [ 4 4
5 I
5 5
6 DAVE t
6 DAVE "-- 6 DAVE --~
7 JEFF -11 7 JEFF "-- 7 JEFF
8 MARK ~ ~- 8 MARK " 8 MARK .I <---i
I

9 DONNA l
-- -J 9 DONNA 9 DONNA i •

(10) TOOTIEAL 10) TOOTLE --. (10) TOOTIE --~
(11) I1) AL (ii) AL -..

Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVE
H a s h Addresses: 1 3 1 1 3 1 8 1

ave. # probes per unsuec, search: ( a ) 1 8 / 9 = 2.0, ( b ) 2 4 / 9 ~ 2.67, ( c ) 1 8 / 9 = 2.0.
ave. # probes per succ. search: ( a ) 2 1 / 8 ~ 2 . 6 3 , ( b ) 2 2 / 8 = 2.75, ( c ) 2 0 / 8 = 2.5.

922

section. It is natural to ask whether the modified deletion Deletions can be done in one of several ways, anal-
algorithm would preserve randomness for the modified ogous to the different methods discussed in the last
insertion algorithms presented in this section. The answer section. In some cases, it is best merely to mark the
is no, but it is possible that the deletion algorithm could record as "deleted," because there may be pointers to the
make the table better-than-random, as discussed at the record from somewhere outside the hash table, and
end of the last section. Finding good deletion algorithms reusing the space could cause problems. Besides, m a n y
for early-insertion and varied-insertion as well as for large scale database systems undergo periodic reorgani-
late-insertion is a challenging problem. zation during low-peak hours, in which the entire table
(minus the deleted records) is reconstructed from scratch
7.2 Coalesced Hashing with Buckets [15]. This method has not been analyzed analytically,
Hashing is used extensively in database applications but it seems to have great potential.
and file systems, where the hash table is too large to fit
entirely in main memory and must be stored on external 7.3 Hash Tables Within a Hash Table
devices, like disks and drums. The hash table is sectioned When the record size R is small compared to the
off into blocks (or pages), each block containing b rec- block size B, the resulting bucket size b ~ B/R is
ords; transfers to and from main m e m o r y take place a relatively large. Sequential search through the blocks is
block at a time. Searching time is dominated by the now too slow. (The block transfer rate no longer domi-
block transfer rate; now the object is to minimize the nates search times.) Other methods should be used to
expected number of block accesses per search. organize the records within blocks.
Operating systems with a virtual memory environ- This is especially true with multiattribute indexing, in
ment are designed to break up data structures into blocks which we can look up records via one of several different
automatically, even though it appears to the programmer keys. For example, a large university database may allow
that his data structures all reside in main memory. Linear a student's record to be accessed by specifying either his
probing (see Sec. 5) is often the best hashing scheme to name, social security number, student I.D., or bank
use in this environment, because successive probes occur account number. In this case, four hash tables are used.
in contiguous locations and are apt to be in the same Instead of storing all the records in four different tables,
block. Thus, one or two block accesses are usually suf- we let the four tables share a single copy of the records.
ficient for lookup. Each hash table entry consists of only the key value, the
We can do better if we know beforehand where the link field, and a pointer to the rest of the student record
block divisions occur. We treat each block as a large (which is stored in some other block). Lookup now
table slot or bucket that can store b records. Let M ' be requires one extra block access. Continuing our numer-
the total number of buckets. The following modification ical example, the table record size reduces from R --- 400
of Algorithm C appears in [7]. bytes to about R = 12 bytes, since the key occupies
To process a record with key K, we search for it in 7 bytes, and the two pointer fields presumably can be
the chain of buckets, starting at bucket hash(K). After squeezed into the remaining 5 bytes. The bucket size b
an unsuccessful search, we insert the record into the last is now about B / R ..~ 333.
bucket in the chain if there is room, or else we store it in In such cases where b is rather large and searching
some nonfull bucket and link that bucket to the end of within a bucket can get expensive, it pays to organize
the chain. We can speed up this last part by maintaining each bucket as a hash table. The hash function must be
a doubly linked circular list of nonfull buckets, with a modified to return a binary number at least [log M ' ] +
"roving pointer" marking one of the buckets. Each time [log b] bits in length; the high-order bits of the hash
we need another nonfull bucket to store a collider, we address specify one of the M ' buckets (or blocks), and
insert the record into the bucket indicated by the roving the low-order bits specify one of the b record positions
pointer, and then we reset the roving pointer to the next within that bucket. Coalesced hashing is a natural
bucket on the list. This helps distribute the records method to use because the bucket size (in this example,
evenly, because different chains will use different buckets b = 333) imposes a definite constraint on the number of
(at least until we make one loop through the available- records that m a y be stored in a block, so it is reasonable
bucket list). When the external device is a disk, block to try to optimize the amount of space devoted to the
accesses are faster when they occur on the same cylinder, address region versus the amount of space devoted to the
so we should keep a separate available-bucket list for cellar.
each cylinder.
Record size varies from application to application, 7.4 Dynamic Hashing
but for purposes of illustration, we use the following So far we have not addressed the problem of what to
parameters: the block size B is 4000 bytes; the total do when overflow occurs--when we want to insert more
record size R is 400 bytes, of which the key comprises 7 records into a hash table that is already full. The c o m m o n
bytes. The bucket size b is approximately B/R = 10. technique is to place the extra records into an auxiliary
When the size of the bucket is that small, searching in storage pool and link them to the main table. Search
each bucket can be done sequentially; there is no need performance remains tolerable as long as the number of
for the record size to be fixed, as long as each record is insertions after overflow does not get too large. (Guibas
preceded by its length (in bytes). [4] analyzes this for the special case of standard coalesced

923

hashing.) Later during the off-hours when the system is rithms and the design of new strategies that hopefully
not heavily used, a larger table is allocated and the will preserve randomness. The variant methods in Sec.
records are reinserted into the new table. 7 also pose interesting theoretical and practical open prob-
This strategy is not viable when database utilization lems. The search performance of varied-insertion coa-
is relatively constant with time. Several similar methods, lesced hashing is slightly better than Algorithm C; re-
known loosely as dynamic hashing, have been devised search is currently underway to analyze its performance
that allow the table size to grow and shrink dynamically and to determine the optimum setting flopt. One excit-
with little overhead [3, 8, 9]. When the load factor gets ing aspect of coalesced hashing is that it is an extreme-
too high or when buckets overflow, the hash table grows ly good technique which very likely can be made even
larger and certain buckets are split, thereby reducing the more applicable when these open questions are solved.
congestion. If the bucket size is rather large, for example,
if we allow multiattribute accessing, then coalesced hash-
ing can be used to organize the records within a block, Appendix
as explained above, thus combining this technique with
coalesced hashing in a truly dynamic way. For purposes of average-case analysis, we assume
that an unsuccessful search can begin at any of the M
address region slots with equal probability. This includes
8. Conclusions the special case of insertion. Similarly, each record in the
table has the same chance of being the object of any
Coalesced hashing is a conceptually elegant and ex- given successful search. In other words, all searches and
tremely fast method for information storage and re- insertions involve random keys. This is sometimes called
trieval. This paper has examined in detail several prac- the Bernoulli probability model
tical issues concerning the implementation of the The asymptotic formulas in this section apply to a
method. The analysis and programming techniques pre- random M'-slot coalesced hash table with address region
sented here should allow the reader to determine whether size M = [tiM'] and with N -- raM'] inserted records,
coalesced hashing is the method of choice in any given where the load factor a and the address factor fl are
situation, and if so, to implement an efficient version of constants in the ranges 0 _< a <- l and 0 < fl _ I. Formal
the algorithm. derivations are given in [10, I l, 13].
The most important issue addressed in this paper is
the initialization of the address factor ft. The intricate Number of Probes Per Search
optimization process discussed in Sec. 4 and the Appen- The expected number of probes in unsuccessful and
dix can in principle be applied to any implementation of successful searches, respectively, as M ' ~ oo is
coalesced hashing. Fortunately, there is no need to un-
dertake such a computational burden for each applica- + e -~/B if a <-- Xfl
tion, because the results presented in this paper apply to
most reasonable implementations. The initialization fl 1 1 2
C'~(M', M) - ~ + g (~(o/B-~) _ l) 3 - ~ + 2X
0.86 is recommended in most cases, because it gives
near-optimum search performance for a wide range of
load factors. The graph in Fig. 2 makes it possible to
fine-tune the choice of fl, in case some prior knowledge
about the types and frequencies of the searches is avail-
able.
f°I +--
la
2B
l(fl
2
•)
( )
ifa>__)~fi (AI)

ifa--<~,fl
The comparisons in Sec. 5 show that the tuned
coalesced hashing algorithm outperforms several popular IB
hashing methods when the load factor is greater then 0.6. I+--.
8a
The differences are more pronounced for large records.
The inner search loop in Algorithm C is very short and
simple, which is important for practical implementations.
CN(M', M)
Coalesced hashing has the advantage over other chaining
methods that it uses only one link field per slot and can (3
achieve full storage utilization. The method is especially
suited for applications with a constrained amount of
memory or with the requirement that the records cannot
+~ +X ),
+~X
be relocated after they are inserted.
In applications where deletions are necessary, one of
the strategies described in Sec. 6 should work well in
practice. However, research remains to be done in several where X is the umque nonnegative solution to the equa-
areas including the analysis of the current deletion algo- tion

of Volume 25
the ACM Number 12

Implementação do Hash Coalha/Coalesced

Implementação do Hash Coalha/Coalesced

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (7)

Similar to Implementação do Hash Coalha/Coalesced

Similar to Implementação do Hash Coalha/Coalesced (20)

Recently uploaded

Recently uploaded (20)

Implementação do Hash Coalha/Coalesced