SlideShare a Scribd company logo
1 of 16
Download to read offline
1. Introduction

Programming Techniques                              Ellis Horowitz            One of the primary uses today for computer technol-
and Data Structures                                         Editor        ogy is information storage and retrieval. Typical search-
                                                                          ing applications include dictionaries, telephone listings,
Implementations for                                                       medical databases, symbol tables for compilers, and
                                                                          storing a company's business records. Each package of
Coalesced Hashing                                                         information is stored in computer memory as a record.
                                                                          We assume there is a special field in each record, called
Jeffrey Scott Vitter                                                      the key, that uniquely identifies it. The job of a searching
Brown University                                                          algorithm is to take an input K and return the record (if
                                                                          any) that has K as its key.
                                                                              Hashing is a widely used searching technique because
    The coalesced hashing method is one of the faster                     no matter how many records are stored, the average
searching methods known today. This paper is a practical                  search times remain bounded. The common element of
study of coalesced hashing for use by those who intend                    all hashing algorithms is a predefined and quickly com-
to implement or further study the algorithm. Techniques                   puted hash function
are developed for tuning an important parameter that
relates the sizes of the address region and the cellar in                        hash: (all possible keys) --~ (1, 2 . . . . . M}
order to optimize the average running times of different
implementations. A value for the parameter is reported                    that assigns each record to a hash address in a uniform
that works well in most cases. Detailed graphs explain                    manner. (The problem of designing hash functions that
how the parameter can be tuned further to meet specific                   justify this assumption, even when the distribution of the
needs. The resulting tuned algorithm outperforms several                  keys is highly biased, is well-studied [7, 2].) Hashing
well-known methods including standard coalesced hash-                     methods differ from one another by how they resolve a
ing, separate (or direct) chaining, linear probing, and                   collision when the hash address of the record to be
double bashing. A variety of related methods are also                     inserted is already occupied.
analyzed including deletion algorithms, a new and im-                           This paper investigates the coalesced hashing algo-
proved insertion strategy called varied-insertion, and ap-                rithm, which was first published 22 years ago and is still
plications to external searching on secondary storage                     one of the faster known searching methods [16, 7]. The
devices.                                                                  total number of available storage locations is assumed to
                                                                          be fixed. It is also convenient to assume that these
    CR Categories and Subject Descriptors: D.2.8 [Soft-                   locations are contiguous in memory. For the purpose of
ware Engineering]: Metrics--performance measures; E.2                     notation, we shall number the hash table slots 1, 2 . . . . .
[Data]: Data Storage Representations--hash-table rep-                     M'. The first M slots, which serve as the range of the
resentations; F.2.2 [Analysis of Algorithms and Problem                   hash function, constitute the address region. The remain-
Complexity]: Nonnumerical Algorithms and Problems--                       ing M ' - - M slots are devoted solely to storing records
sorting and searching; H.2.2 [Database Management]:                       that collide when inserted; they are called the cellar.
Physical Design--access methods; H.3.3 [Information                       Once the cellar becomes full, subsequent colliders must
Storage and Retrieval]: Information Search and Re-                        be stored in empty slots in the address region and, thus,
trieval-search process                                                    may trigger more collisions with records inserted later.
    General Terms: Algorithms, Design, Performance,                             For this reason, the search performance of the coa-
Theory                                                                    lesced hashing algorithm is very sensitive to the relative
    Additional Key Words and Phrases: analysis of algo-                   sizes of the address region and cellar. In Sec. 4, we apply
rithms, coalesced hashing, hashing, data structures, data-                the analytic results derived in [10, I1, 13] in order to
bases, deletion, asymptotic analysis, average-case, opti-                 optimize the ratio of their sizes, fl = M/M', which we
mization, secondary storage, assembly language                            call the address factor. The optimizations are based on
                                                                          two performance measures: the number of probes per
    This research was supported in part by a National Science Foun-       search and the running time of assembly language ver-
dation fellowship and by National Science Foundation grants MCS-          sions. There is no unique best choice for fl--the optimum
77-23738 and MCS-81-05324.
    Author's Present Address: Jeffrey Scott Vitter, Department of         address factor depends on the type of search, the number
Computer Science, Box 1910, Brown University, Providence, RI              of inserted records, and the performance measure cho-
02912.                                                                     s e n - b u t we shall see that the compromise choice fl
    Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for direct    0.86 works well in many situations. The method can be
commercial advantage, the ACM copyright notice and the title of the        further turned to meet specific needs.
publication and its date appear, and notice is given that copying is by         Section 5 shows that this tuned method dominates
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.    several popular hashing algorithms including standard
© 1982 ACM 0001-0782/82/1200-0911 $00.75.                                 coalesced hashing (in which fl = 1), separate (or direct)

911                                                                       Communications                 December 1982
                                                                          of                             Volume 25
                                                                          the ACM                        Number 12
chaining, linear probing, and double hashing. The last                       of both coalesced hashing and separate chaining, because
three sections deal with variations and different imple-                     the cellar is large enough to store the three colliders.
mentations for coalesced hashing including deletion al-                          Figures l(b) and l(c) show how the two methods
gorithms, alternative insertion methods, and external                        differ. The cellar contains only one slot in the example
searching on secondary storage devices.                                      in Fig. l(b). When the key MARKcollides with DONNA at
    This paper is designed to provide a comprehensive                        slot 4, the cellar is already full. Separate chaining would
treatment of the many practical issues concerned with                        report overflow at this point. The coalesced hashing
the implementation of the coalesced hashing method.                          method, however, stores the key MARK in the largest-
Readers interested in the theoretical justification of the                   numbered empty space (which is location 10 in the
results in this paper can consult [10, 11, 13, 14, 1].                       address region). This causes a later collision when DAVE
                                                                             hashes to position 10, so DAVE is placed in slot 8 at the
                                                                             end of the chain containing DONNA and MARK. The
                                                                             method derives its name from this "coalescing" of rec-
2. The Coalesced Hashing Algorithm                                           ords with different hash addresses into single chains.
                                                                                 The average number of probes per search shows
     The algorithm works like this: Given a record with                      marked improvement in Fig. l(b), even though coalesc-
key K, the algorithm searches for it in the hash table,                      ing has occurred. Intuitively, the larger address region
starting at location hash(K) and following the links in                      spreads out the records more evenly and causes fewer
the chain. If the record is present in the table, then it is                 collisions, i.e., the hash function can be thought of as
found and the search is successful; otherwise, the end of                    "shooting" at a bigger target. The cellar is now too small
the chain is reached and the search is unsuccessful. For                     to store these fewer colliders, so it overflows. Fortunately,
simplicity, we assume that the record is inserted when-                      this overflow occurs late in the game, and the pileup
ever the search ends unsuccessfully, according to the                        phenomenon of coalescing is not significant enough to
following rule: If position hash(K) is empty, then the                       counteract the benefits of a larger address region. How-
record is stored at that location; else, it is placed in the                 ever, in the extreme case when M = M ' = 11 and there
largest-numbered empty slot in the table and is linked to                    is no cellar (which we call standard coalesced hashing),
the end of the chain. This has the effect of putting the                     coalescing begins too early and search time worsens (as
first M ' - - M colliders into the cellar.                                   typified by Figure l(c)). Determining the optimum ad-
     Coalesced hashing is a generalization of the well-                      dress factor fl = M/M' is a major focus of this paper.
known separate (or direct) chaining method. The sepa-                            The first order of business before we can start a
rate chaining method halts with overflow when there is                       detailed study of the coalesced hashing method is to
no more room in the cellar to store a collider. The                          formalize the algorithm and to define reasonable
example in Fig. 1(a) can be considered to be an example                      measures of search performance. Let us assume that each

Fig. 1. Coalesced hashing, M ' = 11, N = 8. T h e sizes of the address region are (a) M = 8, (b) M = 10, a n d (c) M = I I .
                (a)                                                      (b)                                                      (c)
    address size = 8                                        a d d r e s s s i z e = 10                               a d d r e s s size = 11
  1     JEFF                                              1             A.L.        :                            1
  2    AUDREY                                             2                                                      2
  3                                                       3            JEFF                                      3       AUDREY
  4    DONNA                                              4          D N A
                                                                       O N                      ~                4        MARK
  5      A.L.                                             5                                                      5         AL
  6                                                       6        A D E
                                                                      U RY                                       6
  7    TOOTIE                                             7                                                      7         DAVE                 ~_
  8                                                       8                                                                JEFF
 (9)         DAVE                                         9              AL                                               DONNA          i
(10)         MARK                                        i0           MARK            /                          i°       TOOTLE
(Ii)          AL                                        (11) TOOTLE                         <                    11         A.L.


        Keys:                           A.L.      AUDREY            AL       TOOTLE          DONNA          MARK           JEFF          DAVE
                               (a)        s             2           2          7               4             5               1             2
Hash Addresses:                (b)        1             6            9          1              4             4               3            10
                               (o)       11             a           5          3               10            4              10             9

average # probes per successful search:                             (a) 1 2 / 8 = 1.5. (b) l l / 8         = 1.375. (c) 1 4 / 8 = 1.75.

912                                                                           Communications                      D e c e m b e r 1982
                                                                              of                                  V o l u m e 25
                                                                              the A C M                           N u m b e r 12
of the M ' contiguous slots in the coalesced hash table           In this paper, we concern ourselves with measuring
has the following organization:                               the searching phase of Algorithm C and ignore for the
                                                              most part the insertion time in steps C5 and C6. (The
           E                                                  time for step C5 is not significant, because the total
           M                                                  number of times R is decremented over the course of all
           P          KEY   other fields   LINK               the insertions cannot be more than the number of in-
           T                                                  serted records; hence, the amortized expected number of
           Y                                                  decrements is at most 1. The decrementing operation can
                                                              also be done in parallel with steps C 1-C4.) Our primary
For each value of i between 1 and M', E M P T Y [i] is a      measure of search performance is the number of probes
one-bit field that denotes whether the ith slot is unused,    per search, which is the number of different table slots
KEY[i] stores the key (if any), and LINK[i] is either the     that are accessed while searching. In Algorithm C, this
index to the next spot in the chain or else the null value    quantity is equal to
0.
    The algorithms in this article are written in the             max{ 1, number of times step C3 is performed}
English-like style used by Knuth in order to make them        For example, in Fig. l(b), the unsuccessful searches for
readily understandable to all and to facilitate compari-      keys A.L. and TOOTIE (immediately prior to their inser-
sons with the algorithms contained in [7, 4, 12]. Block-      tions) each took one probe, while a successful search for
structured languages, like P L / I and Pascal, are good for   DAVE would take two probes.
expressing complicated program modules; however, they             The average performance of the algorithm is ob-
are not used here, because hashing algorithms are so          tained by assuming that all searches and insertions are
short that there is no reason to discriminate against those   random. The Appendix contains a discussion of the
who are not comfortable with such languages.                  probability model as well as the formulas for the ex-
Algorithm C (Coalesced hashing search and insertion).         pected number of probes in unsuccessful and successful
This algorithm searches an M'-slot hash table, looking        searches.
for a given key K. If the search is unsuccessful and the
table is not full, then K is inserted.
                                                              3. Assembly Language Implementation
    The size of the address region is M; the hash function
hash returns a value between 1 and M (inclusive). For
                                                                  Even though probe-counting gives us a good idea of
convenience, we make use of slot 0, which is always
                                                              search performance, other factors (such as the complexity
empty. The global variable R is used to find an empty
                                                              of the search loop and the overhead is computing the
space whenever a collision must be stored in the table.
                                                              hash address) also affect the running time when Algo-
Initially, the table is empty, and we have R = M ' + 1;
                                                              rithm C is programmed for a real computer. For com-
when an empty space is requested, R is decremented
                                                              pleteness, we optimize the running time of assembly
until one is found. We assume that the following initial-
                                                              language versions of coalesced hashing.
izations have been made before any searches or inser-
                                                                  We choose to program in assembly language rather
tions are performed: M ~ [tiM'], for some constant            than in some high-level language like Fortran, PL/I, or
0 < fl _< 1; EMPTY[i] ,,-- true, for all 0 _< i _< M'; and    Pascal, in order to achieve maximum possible efficiency.
R ~ - - M ' + 1.                                              Top efficiency is important in large-scale applications of
C1. [Hash.] Set i ~-- hash(K). (Now 1 _< i _< M.)             hashing, but it can also be achieved in smaller systems
C2. [Is there a chain?] If EMPTY[i], then go to step C6.      with little extra effort, because hashing algorithms are so
    (Otherwise, the ith slot is occupied, so we will look     short that implementing them (even in assembly lan-
    at the chain of records that starts there.)               guage) is easy. We use a hypothetical language based on
C3. [Compare.] I f K = KEY[i], the algorithm terminates       Knuth's Mix [6] because its features are similar to most
      successfully.                                           well-known machines and its inherent simplicity allows
C4. [Advance to next record.] If LINK[i] ~ O, then set        us to write programs in clear and concise form.
    i ~ LINK[i] and go back to step C3.                           Program C below is a Mix-like implementation of
C5. [Find empty slot.] (The search for K in the chain         Algorithm C. Liberties have been taken with the lan-
    was unsuccessful, so we will try to find an empty         guage for purposes of clarity; the actual Mxx code appears
    table slot to store K.) Decrease R one or more times      in [10]. The program is written in a five-column format:
    until EMPTY[R] becomes true. I f R = 0, then there        the first column gives the line numbers, the second
    are no more empty slots, and the algorithm termi-         column lists the instruction labels, the third column
    nates with overflow. Otherwise, append the Rth cell       contains the assembly language instructions, the fourth
    to the chain by setting LINK[i] ~-- R; then set i         column counts the number of times the instructions are
    R.                                                        executed, and the last column is for comments that
C6. [Insert new record.] Set EMPTY[i] <--false, KEY[i]        explain what the instructions do. The syntax of the
       K, LINK[i] ~-- O, and initialize the other fields in   commands should be clear to those familiar with assem-
    the record. •                                             bly language programming. The four memory registers

913                                                           Communications               December 1982
                                                              of                           Volume 25
                                                              the ACM                      N u m b e r 12
used in Program C are named rA, rX, rI, and rJ. The                                 field: empty slots are marked by a - 1 in the L I N K field
reference KEY(I) denotes the contents of the m e m o r y                            of that slot. Null links are denoted by a 0 in the L I N K
location whose address is the value of K E Y plus the                               field. The variable R and the key K are stored in memory
contents of rI. (This is KEY[i] in the notation of Algo-                            locations R and K. Registers rI and rA are used to store
rithm C.)                                                                           the values of i and K. Register rJ stores either the value
    Program C (Coalesced hashing search and insertion).                             of LINK[i] or R. The instruction labels SUCCESS and
This program follows the conventions of Algorithm C,                                O V E R F L O W are for exiting and are assumed to lie
except that the E M P T Y field is implicit in the L I N K                          somewhere outside this code.
                                                                          I
      01 S T A R T             LD         X, K                            1                     Step C1. Load rX with K.
      02                       ENT        A, 0                            1                     Enter 0 into rA.
      03                       DIV        =M=                             1                     rA ~ [K/M], rX ~-- K mod M.
      04                       ENT        I, X                            1                     Enter rX into rI.
      05                       INC        I, 1                            1                     Increment rI by 1.
      06                       LD         A, K                            1                     Load rA with K.
      07                       LD         J, L I N K ( I )                1                     Step C2. Load rJ with LINK[i].
      08                       JN         J, STEP6                        1                     J u m p to STEP6 if LINK[i] < O.
      09                       CMP        A, KEY(l)                      A                      Step C3. C o m p a r e K with KEY[i].
      10                       JE         SUCCESS                        A                      Exit (successessfully) if K = KE Y[i].
      11                       JZ         J, STEP5                    A - SI                    J u m p to STEP5 if LINK[i] = O.
      12 STEP4                 ENT        I, J                         C - 1                    Step C4. Enter rJ into rI.
      13                       CMP        A, KEY(I)                    C - 1                    Step C3. C o m p a r e K with KEY[i].
      14                       JE         SUCCESS                      C- 1                     Exit (successessfully) if K = KEY[i].
      15                       LD         J, L I N K ( I )          C - 1 - $2                  Load rJ with LINK[i].
      16                       JNZ        J, STEP4                  C - 1 - $2                  J u m p to STEP4 if LINK[i] ~ O.
       17    STEP5             LD         J, R                         A - S                    Step C5. Load rJ with R.
      18                       DEC        J, 1                           T                      Decrement R by 1.
      19                       LD         X, L I N K ( J )               T                      Load rX with LINK[R].
      20                       JNN        X, .-2                         T                      G o back two steps if LINK[R] >_ O.
      21                       JZ         J, O V E R F L O W           A - S                    Exit (with overflow) if R = 0.
      22                       ST         J, L I N K ( I )             A - S                    Store R in LINK[i]
      23                       ENT        I, J                                A -    S          Enter rJ into rI.
      24                       ST         J, R                                A - S             Update R in memory.
      25 STEP6                 ST         0, L I N K ( I )                    1- S              Step C6. Store 0 in LINK[i].
      26                       ST         A, KEY(I)                           1- S              Store K i~ KEY[i]. •

      The execution time is measured in MIX units of time,                              The fourth column of Program C expresses the num-
which we denote u. The n u m b e r of time units required                           ber of times each instruction is executed in terms of the
by an instruction is equal to the number of m e m o r y                             quantities
references (including the reference to the instruction
                                                                                    C = n u m b e r of probes per search.
itself). Hence, the LD, ST, and CMP instructions each
                                                                                    A = 1 if the initial probe found an occupied slot,
take two units of time, while ENT, INC, DEC, and the
                                                                                        0 otherwise.
j u m p instructions require only one time unit. The divi-
                                                                                    S = 1 if successful, 0 if unsuccessful.
sion operation used to compute the hash address is an
                                                                                    T = n u m b e r of slots probed while looking for an empty
exception to this rule; it takes 14u to execute.
                                                                                        space.
      The running time of a MIX program is the weighted
sum                                                                                 We further decompose S into S 1 + $2, where S 1 = 1 if
                                                                                    the search is successful on the first probe, and S1 = 0
                         # times             '~// # time units '~                   otherwise. By formula (1), the total running time of the
                     the i n s t r u c t i o n ~ required by ~      (1)             searching phase is
  each instruction     is executed /  t h e instruction]
  in the program
                                                                                                     (7C + 4A + 17 - 3S + 2 S l ) u         (2)
This is a somewhat simplistic model, since it does not                              and the insertion of a new record after an unsuccessful
make use of cache or buffered m e m o r y for fast access of                        search (when S = 0) takes an additional (SA + 4 T + 4)u.
frequently used data, and since it ignores any interven-                            The average running time is the expected value of (2),
tion by the operating system. But it places all hashing                             assuming that all insertions and searches are random.
algorithms on an equal footing and gives a good indi-                               The formula can be obtained by replacing the variables
cation of relative merit.                                                           in Eq. (2) with their expected values.

914                                                                                 Communications                  D e c e m b e r 1982
                                                                                    of                              V o l u m e 25
                                                                                    the ACM                         N u m b e r 12
4. Tuning fl to Obtain Optimum Performance                     4.2 MIX Running Times
                                                                   Optimizing the MIX execution times could be tricky,
    The purpose of the analysis in [10, 11, 13] is to show     in general, because the formulas might have local as well
how the average-case performance of the coalesced hash-        as global minima. Then when we set the derivatives
ing method varies as a function of the address factor fl       equal to 0 in order to find floPr, there might be several
= M / M ' and the load factor a = N/M'. In this section,       roots to the resulting equations. The crucial fact that lets
for eachfixed value of a, we make use of those results in      us apply the same optimization techniques we used above
order to "tune" our choice of fl and speed up the search       for the number of probes is that the formulas for the MIX
times. Our two measures of performance are the expected        running times are well-behaved, as shown in the Appen-
number of probes per search and the average running            dix. By that we mean that each formula is minimized at
time of assembly language versions. In the latter case,        a unique floPT, which occurs either at the endpoint a =
we study a MIX implementation in detail, and then show         Aft or at the unique point in the "a > Aft" region where
how to apply what we learn to other assembly languages.        the derivative w.r.t, fl is 0.
    Unfortunately, there is no single choice of fl that            The optimization procedure is the same as before.
yields best results: the optimum choice flOPWis a function     The expected values of formulas (A4) and (A5), which
of the load factor a and it is even different for unsuc-       give the MIX running times for unsuccessful and success-
cessful and successful searches. The section concludes         ful searches, are functions of the three variables a, fl, and
with practical tips on how to initialize ft. In particular,    A. We substitute Eq. (3) into the expected running times
we shall see that the choice fl = 0.86 works well in most      in order to express fl in terms of A. For several different
situations.                                                    load factors c~ and for each type of search, we find the
                                                               value of A that minimizes the formula, and then we
4.1 Number of Probes Per Search                                retranslate this value via Eq. (3) to get flOPW.Figure 2(b)
    For each fixed value of a, we want to find the values      graphs these optimum values flOPW as a function of a;
flOPT that minimize the expected number of search probes       spline interpolation was used to fill in the gaps. As in the
in unsuccessful and successful searches. Formulas (A1)         previous section, the formulas for the average unsuccess-
and (A2) in the Appendix express the average number            ful and successful search times yield different optimum
of probes per search as a function of three variables: the     address factors. For the successful search case, notice
load factor c~ = N/M', the address factor fl = M/M',           how closely flOPT agrees with the corresponding values
and a new variable A = L/M, where L is the expected            that minimize the expected number of probes.
number of inserted records needed to make the cellar
become full. The variables fl and A are related by the
formula
                                   1                           Fig. 2. The values //OPT that optimize search performance for the
                       e -~ + A = -                     (3)    following three measures: (a) the expected number of probes per
                                  B                            search, (b) the expected running time of Program C, and (c) the
                                                               expected assembly language running time for large keys.
Formulas (A1) and (A2) each have two cases, "a _<
Aft" and "a _> Aft," which have the following intuitive                 1.o ~                                                                      1.0
meanings: The condition a < Aft means that with high
probability not enough records have been inserted to fill
up the cellar, while the condition a > Aft means that
enough records have been inserted to make the cellar
almost surely full.
    The optimum address factor flOPW is always located                                                                    Successful
somewhere in the "a _> Aft" region, as shown in the
Appendix. The rest of the optimization procedure is a             ~.~ 0.9                                                                          0.9
straightforward application of differential calculus. First,
we substitute Eq. (3) into the "a _> Aft" cases of the
formulas for the expected number of probes per search
in order to express them in terms of only the two                 ._E
variables a and A. For each nonzero fixed value of a, the                                                              (b) ~
                                                                                            Uns....... ful           ~ k" ~
                                                                                                                     
formulas are convex w.r.t. A and have unique minima.
We minimize them by setting their derivatives equal to
0. Numerical analysis techniques are used to solve the                  0,8
resulting equations and to get the optimum values of A
for several different values of a. Then we reapply Eq. (3)
to express the optimum points in terms of ft. The results
are graphed in Fig. 2(a), using spline interpolation to fill                0   0.1   0.2   0.3   0,4    0,5    0.6     0.7    0.8     0,9   1.0

in the gaps.                                                                                        ].oad]:actor,a

915                                                            Communications                            December 1982
                                                               of                                        Volume 25
                                                               the ACM                                   Number 12
4.3 Applying the Results to Other Implementations                 One strategy is to pick fl = 0.782, which minimizes
     Our MIX analysis suggests two important principles        the expected number of probes per unsuccessful search
 to be used in finding/?OPT for a particular implementa-       as well as the average MIX unsuccessful search time when
 tion of coalesced hashing. First, the formulas for the        the table is full (i.e., load factor a = l), as indicated in
 expected number of times each instruction in the pro-         Fig. 2. This choice of/3 yields the best absolute bound
gram is executed (which are expressed for Program C in         on search performance, because when the table is full,
terms of C, A, S, S 1, $2, and T) may have the two cases,      search times are greatest and unsuccessful searches av-
"a --< )~/3" and "a _> )~/3," but probably not more.           erage slightly longer than successful ones. Regardless of
     Second, the same optimization process as above can        the load factor, the expected number of probes per search
be used to find /3OPT, because the formulas for the            would be at most 1.79, and the average MIX searching
running times should be well-behaved for the following         time would be bounded by 33.52u.
reason: The main difference between Program C and                  Another strategy is to pick some compromise address
another implementation is likely to be the relative time       factor that leads to good overall performance for a large
it takes to process each key. (The keys are assumed to be      range of load factors. A reasonable choice is/3 = 0.86;
very small in the MIX version.) Thus, the unsuccessful         then the unsuccessful searches are optimized (over all
search time for another implementation might be ap-            other values o f f l ) when the load factor is =0.68 (number
proximately                                                    of probes) and ,~0.56 (MIX), and the successful search
                                                               performance is optimized at load factors -~0.94 (number
        [(2x + 5)C + (2x + 2)A + ( - 2 x + 19)]u'        (4)
                                                               of probes) and -~0.95 (MIX).
where u' is the standard unit of time on the other                 Figures 3 through 6 graph the expected search per-
computer and x is how many times longer it takes to            formance of coalesced hashing as a function of a for
process a key (multiplied by u/u'). Successful search          both types of searches (unsuccessful and successful) and
times would be about                                           for both measures of performance (number of probes
                                                               and MiX running time). The C1 curve corresponds to
                [(2x + 5 ) C + 18 + 2 S 1 ] u '          (5)
                                                               standard coalesced hashing (i.e., fl = l); the Co.86 line is
Formulas (4) and (5) were calculated by increasing the         our compromise choice fl = 0.86; and the dashed line
execution times of the key-processing steps 9 and 13 in        CoPx represents the best possible search performance
Program C by a factor of x. (See formulas (A4) and (A5)        that could be achieved by tuning (in which fl is optimized
for the x = 1 case.) We ignore the extra time it takes to      for each load factor).
load the larger key and to compute the hash function,              Notice that the value/3 = 0.86 yields near-optimum
since that does not affect the optimization.                   search times once the table gets half-full, so this compro-
    The role of C in formula (4) is less prevalent than in     mise offers a viable strategy. Of course, if some prior
(A4) as x gets large: the ratio of the coefficients of C and   knowledge about the types and frequencies of the
A decreases from 7/4 in (A4) and approaches the limit          searches were available, we could tailor our choice of/3
2/2 = 1 in formula (4). Even in this extreme case,             to meet those specific needs.
however, computer calculations show that the formula
for the average running time is well-behaved. The values
of/3OPT that minimize formula (4) when x is large are          5. Comparisons
graphed in Fig. 2(c).
    For successful searches, however, the value of C more          In this section, we compare the searching times of the
strongly dominates the running times for larger values of      coalesced hashing method with those from a represent-
x, so the limiting values offloPw in Fig. 2(c) coincide with   ative collection of hashing schemes: standard coalesced
the ones that minimize the expected number of probes           hashing (C1), separate chaining (S), separate chaining
per search in Fig. 2(a). Figure 2(b) shows that the            with ordered chains (SO), linear probing (L), and double
approximation is close even for the case x = l, which is       hashing (D). Implementations of the methods are given
Program C.                                                     in [10].
                                                                   These methods were chosen because they are the
4.4 How to Choose fl                                           most well-known and since they each have implemen-
    It is important to remember that the address region        tations similar to that of Algorithm C. Our comparisons
size M = [tiM'] must be initialized when the hash table        are based both on the expected number of probes per
is empty and cannot change thereafter. Unfortunately,          search as well as on the average MIX running time.
the last two sections show that each different load factor     Coalesced hashing performs better than the other
a requires a different optimum address factor /3OPT; in        methods. The differences are not so dramatic with the
fact, the values of flOPW differ for unsuccessful and suc-     MIX search times as with the number of probes per
cessful searches. This means that optimizing the average       search, due to the large overhead in computing the hash
unsuccessful (or successful) search time for a certain load    address. However, if the keys were larger and compari-
factor a will lead to suboptimum performance when the          sons took longer, the relative MIX savings would closely
load factor is not equal to a.                                 approximate the savings in number of probes.

916                                                            Communications               December 1982
                                                               of                           Volume 25
                                                               the ACM                      Number 12
Fig. 3. The average number of probes per unsuccessful search, as M                                                Fig. 4. The average number of probes per successful search, as M and
 and M' --~ ~, for coalesced hashing (C,, Co.86, COPT for fl = 1, 0.86,                                            M' ---> o0, for coalesced hashing (C,, C0.~6, COPT for fl = 1, 0.86, floP'r),
 flovr), separate chaining (S), separate chaining with ordered chains                                              separate chaining (S), separate chaining with ordered chains (SO),
 (SO), linear probing (L), and double hashing (D).                                                                 linear probing (L), and double hashing (D).
                                                                                                                              25


                                                                                                                                                                                  L /          l                  2.5




            2.0                                                                                       Y 2.0              ~   2.0




            1.5                                                            /         /                      1.5       ;.     L5
                                                                                                                                                                       /
                                                                                                /
                                                                           ~ s                  ~                                                                               S, SO

                                                                           ~         so

            I.O /             ~                                                                             1.0
               0      0.1     0,2     0.3       0.4 0.5 0.h            0.7      0.8        0.9        1.0                          0    0. I    02    0.3    0.4     0,5 06        0.7    08        0.9     1,0
                                               l.oadfactor, a or                                                                                            l.oadl,,ctor,a or

5.1 Standard Coalesced Hashing (C1)                                                                               "tuned" coalesced hashing are identical. Figures 3 and
    Standard coalesced hashing is the special case of                                                             4 show that the savings in number of probes per search
coalesced hashing for which fl = 1 and there is no cellar.                                                        can be as much as 14 percent (unsuccessful) and 6
This is obviously the most realistic comparison that can                                                          percent (successful). In Figs. 5 and 6, the corresponding
be made, because except for the initialization of the                                                             savings in MIX searching time is 6 percent (unsuccessful)
address region size, standard coalesced hashing and                                                               and 2 percent (successful).
Fig. 5. The average Mix execution time per unsuccessful search, as                                                Fig. 6. The average Mix execution time per successful search, as
M' ---> oo, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, flOPT),                                      M' --> ~, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, floPT),
separate chaining (S), separate chaining with ordered chains (SO),                                                separate chaining (S), separate chaining with ordered chains (SO),
linear probing (L), and double hashing (D).                                                                       linear probing (L), and double hashing (D).
            40
                                                                                                                             40                                                          L~D                    40
      {
                                                                                                Cl      35
                                                                                                                             35                                                                                 35




                                                                                                        30
                                                                                                                                          ~o ~         ~                                                        30
      .=                                                                                                                                                                                   ---

                                                                                                                     x
      ~. 25                                                                                            25
      ~     J



           20                                                                                           20                   20
                0   0,1     0.2     0,3      0.4     0.5       0.6   0.7       0,8        0.9        1.0                          0    0.I     0.2   0.3    0.4 0.5 0.6          0.7     0.8       0.9    1.0
                                            Load Factor,   a                                                                                                   I.oadFaca)r, a
917                                                                                                               Communications                                  December 1982
                                                                                                                  of                                              Volume 25
                                                                                                                  the A C M                                       Number 12
5.2 Separate (or Direct) Chaining (S)                         cessful search time of Program SO is worse than Program
     The separate chaining method is given an unfair          C's, and in real-life situations, the difference is likely to
 advantage in Figs. 3 and 4: the number of probes per         be more apparent, because records that are inserted first
 search is graphed as a function of ~ = N / M rather than     tend to be looked up more often and should be kept near
 a = N / M ' and does not take into account the number of     the beginning of the chain, not rearranged.
 auxiliary slots used to store colliders. In order to make        Method SO has the same storage limitations as the
the comparison fair, we must adjust the load factor           separate chaining scheme (i.e., the table usually over-
 accordingly.                                                 flows when N = M = 0.731M'), whereas coalesced
    Separate chaining implementations are designed of-        hashing can obtain full storage utilization.
ten to accommodate about N = M records; an average
of M(1 - 1 / M ) M ~ M / e auxiliary slots are needed to      5.4 Linear Probing (L) and Double Hashing (D)
store the colliders. The total table size is thus M ' = M          When searching for a record with key K, the linear
+ M/e. Solving backwards for M, we get M = 0.731M'.           probing method first checks location hash(K), and if
In other words, we may consider separate chaining to be       another record is already there, it steps cyclically through
the special case of coalesced hashing for which /3 -~         the table, starting at location hash(K), until the record is
0.731, except that no more records can be inserted once       found (successful search) or an empty slot is reached
the cellar overflows. Hence, the adjusted load factor is      (unsuccessful search). Insertions are done by placing the
a = 0.731~, and overflow occurs when there are around         record into the empty slot that terminated the unsuc-
N = M = 0.73 I M ' inserted records. (This is a reasonable    cessful search. Double hashing generalizes this by letting
space/time compromise: if we make M smaller, then             the cyclic step size be a function of K.
more records can usually be stored before overflow                 We have to adjust the load factor in the opposite
occurs, but the average search times blow up; if we           direction when we compare Algorithm C with methods
increase M to get better search times, then overflow          L and D, because the latter do not require L I N K fields.
occurs much sooner, and many slots are wasted.)               For example, if we suppose that the L I N K field com-
    If we adjust the load factors in Figs. 3 and 4 in this    prises ¼ of the total record size in a coalesced hashing
way, Algorithm C generates better search statistics: the      implementation, then the search statistics in Figs. 3 and
expected number of probes per search for separate chain-      4 for Algorithm C with load factor a should be compared
ing is -~ 1.37 (unsuccessful) and -~ 1.5 (successful) when    against those for linear probing and double hashing with
the load factor 6 is 1, while that for coalesced hashing is   load factor (¼)a. In this case, the average number of
   1.32 (unsuccessful) and -~ 1.44 (successful) when the      probes per search is still better for coalesced hashing.
load factor a =/3~ is equal to 0.731.                              However, the L I N K field is often much smaller than
    The graphs in Figs. 5 and 6 already reflect this load     the rest of the record, and sometimes it can be included
factor adjustment. In fact, the MIX implementation of         in the table at virtually no extra cost. The Mix imple-
separate chaining (Program S in [10]) is identical to         mentation Program C in [10] assumes that the raix field
Program C, except that /3 is initialized to 0.731 and         can be squeezed into the record without need of extra
overflow is signaled automatically when the cellar runs       storage Space. Figures 5 and 6, therefore, require no load
out of empty slots. Program C is slightly quicker in MIX      factor adjustment.
execution time than Program S, but more importantly,               To balance matters, the M~X implementations of lin-
the coalesced hashing implementation is more space            ear probing and double hashing, which are given in [10]
efficient: Program S usually overflows when a = 0.731,        and [7], contain two code optimizations. First, since
while Program C can always obtain full storage utiliza-       L I N K fields are not used in methods L and D, we no
tion a = 1. This confirms our intuition that coalesced        longer need 0 to denote a null L I N K , and we can
hashing can accomodate more records than the separate         renumber the table slots from 0 to M ' - 1; the hash
chaining method and still outperform separate chaining        function now returns a value between 0 and M ' - 1.
before that method overflows.                                 This makes the hash address computation faster by lu,
                                                              because the instruction INC I, 1 can be eliminated.
5.3 Separate Chaining with Ordered Chains (SO)                Second, the empty slots are denoted by the value 0 in
    This method is a variation of separate chaining in        order to make the comparisons in the inner loop as fast
which the chains are kept ordered by key value. The           as possible. This means that records are not allowed to
expected number of probes per successful search does          have a key value of 0. The final results are graphed in
not change, but unsuccessful searches are slightly            Figs. 5 and 6. Coalesced hashing clearly dominates when
quicker, because only about half the chain needs to be        the load factor is greater than 0.6.
searched, on the average.
    Our remarks about adjusting the load factor in Figs.
3 and 4 also apply to method SO. But even after that is       6. Deletions
done, the average number of probes per unsuccessful
search as well as the expected MIX unsuccessful search            It is often useful in hashing applications to be able to
time is slightly better for this method than for coalesced    delete records when they no longer logically belong to
hashing. However, as Fig. 6 illustrates, the average suc-     the set of objects being represented in the hash table. For

918                                                           Communications                December 1982
                                                              of                            Volume 25
                                                              the A C M                     N umbe r 12
example, in an airlines reservations system, passenger                         TOOTIE rehashes to the hole in location 10, so TOOTIE
records are often expunged soon after the flight has                           moves up to plug the hole, leaving a new hole in position
taken place.                                                                   9. Next, DONNA collides with AUDREY during rehashing,
     One possible deletion strategy often used for linear                      so DONNA remains in slot 8 and is linked to AUDREY.
probing and double hashing is to include a special one-                        Then MARK also collides with AUDREY; we leave MARK in
bit D E L E T E D field in each record that says whether or                    position 7 and link it to DONNA, which was formerly at
not the record has been deleted. The search algorithm                          the end of AUDREY'Shash chain. The record JEFF rehashes
must be modified to treat each "deleted" table slot as if                      to the hole in slot 9, so we move it up to plug the hole,
it were occupied by a null record, even though the entire                      and a new hole appears in position 6. Finally, DAVE
record is still there. This is especially desirable when                       rehashes to position 9 and joins JEVF'S chain.
there are pointers to the records from outside the table.                          Location 6 is the current hole position when the
     I f there are no such external pointers to worry about,                   deletion algorithm terminates, so we set EMPTY[6] ~--
the "deleted" table slots can be reused for later insertions:                  true and return it to the pool of empty slots. However,
Whenever an empty slot is needed in step C5 of Algo-                           the value of R in Algorithm C is already 5, so step C5
rithm C, the record is inserted into the first "deleted"                       will never try to reuse location 6 when an empty slot is
slot encountered during the unsuccessful search; if there                      needed.
is no such slot, an empty slot is allocated in the usual                           We can solve this problem by using an available-
way. However, a certain percentage of the "deleted" slots                      space list in step C5 rather than the variable R; the list
probably will remain unused, thus preventing full storage                      must be doubly linked so that a slot can be removed
utilization. Also, insertions and deletions over a pro-                        quickly from the list in step C6. The available-space list
longed period would cause the expected search times to                         does not require any extra space per table slot, since we
approximate those for a full table, regardless of the                          can use the K E Y and L I N K fields of the empty slots for
n u m b e r of undeleted records, because the "deleted"                        the two pointer fields. (The K E Y field is much larger
records make the searches longer.                                              than the L I N K field in typical implementations.) For
     I f we are willing to spend a little extra time per                       clarity, we rename the two pointer fields N E X T and
deletion, we can do without the D E L E T E D field by                         P R E V . Slot 0 in the table acts as the d u m m y start of the
relocating some of the records that follow in the chain.                       available-space list, so NEXT[O] points to the first actual
The basic idea is this: First, we find the record we want                      slot in the list and PREV[O] points to the last. Before
to delete, mark its table slot empty, and set the L I N K                      any records are inserted into the table, the following
field of its predecessor (if any) to the null value 0. Then                    extra initializations must be made: NEXT[O] <--- M '
we use Algorithm C to reinsert each record in the re-                          P R E V [ M ' ] ,--- 0; and N E X T [ i ] ~ i - 1 and P R E V [ i -
mainder of the chain, but whenever an empty slot is                            1] ~ i, for 1 _< i _< M'. We replace steps C5 and C6 by
needed in step C5, we use the position that the record                         C5. [Find empty slot.] (The search for K in the chain
already occupies.                                                                   was unsuccessful, so we will try to find an empty
     This method can be illustrated by deleting AL from                             table slot to store K.) I f the table is already full (i.e.,
location l0 in Fig. 7(a); the end result is pictured in Fig.                        NEXT[O] = 0), the algorithm terminates with over-
7(b). The first step is to create a hole in position l0 where                      flow. Otherwise, set L I N K [ i ] *---NEXT[O] and i *--
AL was, and to set AUDREY'S L I N K field to 0. Then we                             NEXT[0].
process the remainder of the chain. The next record                            C6. [Insert new record.] Remove the ith slot from the

Fig. 7. (a) Inserting the eight records; (b) Inserting all the records except AL.
                                             (a)                                                   (b)
                              1       AUDREY                                          1      AUDREY
                              2                                                      2
                              8                                                      3
                              4                                                      4                                   I
                              5      .. DAVE                                         5         DAVE
                              6         JEFF                                         6
                              7        MARK             -~                           7        MARK
                              8        DONNA            -I                           8       DONNA
                              9       ,TOOTlEAL         ""~J-1                       9        JEFF
                             i0                                                      I0      TOOTIE
                             Ii          A.L.                                        Ii        A.L.


             Keys:                     A.L.        AUDREY           AL       TOOTIE        DONNA          MARK        JEFF       DAVE
        Hash Addresses:                 11            1              1         10            1              1           9          9

919                                                                             Communications                   December 1982
                                                                                of                               Volume 25
                                                                                the ACM                          Number 12
available-space list by setting PREV'[NEXT[i]] ~--             resulting table is better-than-random: the average search
      PREV[i] and N E X T [ P R E V [ i ] ] ~-- NEXT[i]. Then        times after N random insertions and one deletion are
      set E M P T Y [ i ] ~-- false, KEY[i] ~ K, L I N K [ i ] ~--   sometimes better (and never worse) than they would be
      0, and initialize the other fields in the record.              with N - 1 random insertions alone. Whether or not this
                                                                     remains true after more than one deletion is an open
     The following deletion algorithm is analyzed in
                                                                     problem.
detail in [10] and [14].
                                                                          If this deletion algorithm is used when there is a
Algorithm CD (Deletion with coalesced hashing). This                 cellar (i.e., fl < 1), we can modify it so that whenever a
algorithm deletes the record with koy K from a coalesced             hole appears in the cellar during the execution of Algo-
hash table constructed by Algorithm C, with steps C5                 rithm CD, the next noncellar record in the chain moves
and C6 modified as above.                                            up to plug the hole. Unfortunately, even with this mod-
   This algorithm preserves the important invariant that             ification, the algorithm does not break up chains well
K is stored at its hash address if and only if it is at the          enough to preserve randomness. It seems possible that
start of its chain. This makes searching for K's predeces-           search performance may remain very good anyway.
sor in the chain easy: if it exists, then it must come at or         Analytic and empirical study is needed to determine just
after position hash(K) in the chain.                                 "how far from r a n d o m " the search times get after dele-
C D I . [Search for K.] Set i ~ hash(K). If E M P T Y [ i ] ,        tions are performed.
                                                                          Two remarks should be made about implementing
        then K is not present in the table and the algorithm
                                                                     this modified deletion algorithm. In step CD6, the empty
        terminates. Otherwise, if K = KEY[i], then K is at
                                                                     slot should be returned to the start of the available-space
        the start of the chain, so go to step CD3.
                                                                     list when the slot is in the cellar; otherwise, it should be
CD2. [Split chain in two.] (K is not at the start of its
                                                                     placed at the end. This has the effect of giving cellar slots
         chain.) Repeatedly set P R E D ~-- i and i *--
                                                                     higher priority on the available-space list. Second, if a
        L I N K [ i ] until either i = 0 or K = KEY[i]. I f i =
                                                                     cellar slot is freed by a deletion and then reallocated
        0, then K is not present in the table, and the
                                                                     during a later insertion, it is possible for chain to go in
        algorithm terminates. Else, set L I N K [ P R E D ]
                                                                     and out of the cellar more than once. Programmers
        0.
                                                                     should no longer assume that a chain's cellar slots im-
CD3. [Process remainder of chain.] (Variable i will walk
                                                                     mediately follow the start of the chain.
        through the successors of K in the chain.) Set
        H O L E ~ i, i ~ LINK[i], L I N K [ H O L E ] ~-- O.
        Do step CD4 zero or more times until i = 0. Then
                                                                     7. Implementations and Variations
        go to step CD5.
CD4. [Rehash record in ith slot.] Set j ~ hash(KEY[i]).
                                                                          Most important searching algorithms have several
        I f j = H O L E , we move up the record to plug the
                                                                     different implementations in order to handle a variety of
        hole by setting K E Y [ H O L E ] ~-- KEY[i] and
                                                                     applications; coalesced hashing is no exception. We have
        H O L E ~ i. Otherwise, we link the record to the
                                                                     already discussed some modifications in the last section
        end of its hash chain by doing the following: set
                                                                     in connection with deletion algorithms. In particular, we
        j .-- L I N K [ j ] zero or more times until L I N K [ j ]
                                                                     needed to use a doubly linked available-space list so that
        = 0; then set L I N K [ j ] *-- i. Set k *-- LINK[i],
                                                                     the empty slots could be added and removed quickly.
        LINK[i] ~ O, and i *-- k. Repeat step CD4 unless
                                                                     Thus, the cellar need not be contiguous. Another strategy
        i=0.
                                                                     to handle a noncontiguous cellar is to link all the table
CDS. [Mark slot H O L E empty.] Set E M P T Y [ H O L E ]
                                                                     slots together initially and to replace "Decrease R " in
             true. Place H O L E at the start of the available-
                                                                     step C5 of Algorithm C with "Set R *-- L I N K [ R ] . " With
        space list by setting N E X T [ H O L E ] ~ NEXT[O],
                                                                     either modification, Algorithm C can simulate the sepa-
        PRE V [ H O L E ] ~-- O, P R E V[NEXT[O]] ~ H O L E ,
                                                                     rate chaining method until the cellar empties; subsequent
        NEXT[O] ~ H O L E .                                     •
                                                                     colliders can be stored in the address region as usual.
    Algorithm CD has the important property that it                  Hence, coalesced hashing can have the benefit of dy-
preserves randomness for the special case of standard                namic allocation as well as total storage utilization.
coalesced hashing (when M = M ' ) , in that deleting a                    Another c o m m o n data structure is to store pointers
record is in some sense like never having inserted it. The           to the fields, rather than the fields themselves, in the
"sense" is strong enough so that the formulas for the                table slots. For example, if the records are large, we
average search times are still valid after deletions are             might want to store only the key and link values in each
performed. Exactly what preserving randomness means                  slot, along with a pointer to where the rest of the record
is explained in detail in [14].                                      is located. We expand upon this idea later in this section.
    We can speed up the rehashing phase in the latter                     If we are willing to do extra work during insertion
half of step CD4 by linking the record into the chain                and if the records are not pointed to from outside the
immediately after its hash address rather than at the end            table, we can modify the insertion algorithm to prevent
of the chain. When this modified deletion algorithm is               the chains from coalescing: W h e n a record R1 collides
called on a random standard coalesced hash table, the                during insertion with another record Rz that is not at the

920                                                                  Communications                December 1982
                                                                     of                            Volume 25
                                                                     the ACM                       Number 12
start of the chain, we store R, at its hash address and                   rithm (Algorithm C in Sec. 2) as the late-insertion
relocate R2 to some other spot. (The LINK field of R2's                   method.
predecessor must be updated.) The size of the records                         Early-insertion can be used even if we do not have a
should not be very large or else the cost of rearrangement                priori knowledge about the record's presence, in which
might get prohibitive. There is an alternate strategy that                case the entire chain must be searched in order to verify
prevents coalescing and does not relocate records, but it                 that the record is not already stored in the table. We can
requires an extra link field per slot and the searches are                implement this form of early-insertion by making the
slightly longer. One link field is used to chain together                 following two modifications to Algorithm C. First, we
all the records with the same hash address. The other                     add the assignment "Set j ~-- i" at the end of step C2, so
link field contains for slot i a pointer to the start of the              that j stores the hash address hash(K). The second
chain of records with hash address i. Much of the space                   modification replaces the last sentence of step C5 by
for the link fields is wasted, and chains m a y start one                 "Otherwise, link the R t h cell into the chain immediately
link away from their hash address. Resources could be                     after the hash addressj by setting LINK[R] ~--LINK[j],
put to better use by using coalesced hashing.                             LINK[j] ~ R; then set i ~ R."
    This section is devoted to the more nonobvious im-                        Each chain of records formed using early-insertion
plementations of coalesced hashing. First, we describe                    contains the same records as the corresponding chain
alternate insertion strategies and then conclude with                     formed by late-insertion. Since the length of a random
three applications to external searching on secondary                     unsuccessful search depends only on the number of
storage devices. A scheme that allows the coalesced hash                  records in the chain between the hash address and the
table to share m e m o r y with other data structures can be              end of the chain, and since all the records are in the
found in [ 12]. A generalization of coalesced hashing that                address region when there is no cellar, it must be true
uses nonuniform hash functions is described in [13].                      that the average n u m b e r of probes per unsuccessful
                                                                          search is the same for the two methods if there is no
7.1 Early-Insertion and Varied-lnsertion Coalesced                        cellar. However, the order of the records within each
Hashing                                                                   chain m a y be different for early-insertion than for late-
    I f we know a priori that a record is not already                     insertion. When there is no cellar, the early-insertion
present in the table, then it is not necessary in Algorithm               algorithm causes the records to align themselves in the
C to search to the end of the chain before the record is                  chains closer to their hash addresses, on the average,
inserted: I f the hash address location is empty, the record              than would be the case with late-insertion, so the ex-
can be inserted there; otherwise, we can link the record                  pected successful search times are better.
into the chain immediately after its hash address by                          A typical case is illustrated in Fig. 8. The record DAVE
rerouting pointers. We call this the early-insertion method               collides with A.L. at slot 5. In Fig. 8(a), which uses late-
because the collider is linked "early" in the chain, rather               insertion, DAVE is linked to the end of the chain contain-
than at the end. We will refer to the unmodified algo-                    ing A.L., whereas if we use early-insertion as in Fig. 8(b),

Fig. 8. Standard Coalesced Hashing, M = M ' = 11, N = 8. (a) Late-insertion; (b) Early-insertion.

                                           (a)                                                    (b)
                                  late-insertion                                       early-insertion
                               a d d r e s s s i z e = 11                            a d d r e s s s i z e = 11
                             1       AUDREY                                        1       AUDREY
                             2                                                    2
                             3       DONNA                                        S        DONNA
                             4        JEFF                                        4         JE~
                             5         A.L.                     ~,                5          A.L.
                            6                                                     6
                            7                                                     7
                            8        ..DAVE                                                 DAVE
                            9         MARK            ~                           9         MARK
                            10       TOOTIE                                       10       TOOTIE
                            ii          AL            /     .                     11         AL           -I"


                 Keys:                A.L.       AUDREY          AL      TOOTIE         DONNA          MARK          JEFF           DAVE
        Hash Addresses:                5            1             5        10             3             11             4              5

        ave. ]/probes         per succ. search:             ( a ) 1 3 / 8 ~ 1.63, ( b ) 1 2 / 6 = 1.5.

921                                                                        Communications                    D e c e m b e r 1982
                                                                           of                                Volume 25
                                                                           the A C M                         N u m b e r 12
DAVE is linked into the chain at the point between A.L.                      identical to early-insertion. In the varied-insertion
 and AL. The average successful search time in Fig. 8(b)                      method, the early-insertion strategy is used except when
 is slightly better than in Fig. 8(a), because linking DAVE                   the cellar is full and the hash address of the inserted
 into the chain immediately after A.L. (rather than at the                    record is the start of a chain that has records in the
 end of the chain) reduces the search time for DAVE from                      cellar. In that case, the record is linked into the chain
 four probes to two and increases the search time for AL                     immediately after the last cellar slot in the chain.
 from two probes to three. The result is a net decrease of                        Figure 9(c) shows a typical hash table constructed
 one probe.                                                                  using varied-insertion. The cellar is already full when
     One can show easily that this effect manifests itself                   the record DAVE is inserted. The hash address of DAVE is
 only on chains of length greater than 3, so there is little                  1, which is at the start of a chain that has records in the
 improvement when the load factor a is small, since the                      cellar. Therefore, early-insertion is not used, and DAVE
 chains are usually short. Recent theoretical results show                   is instead linked into the chain immediately after AL,
 that the average number of probes per successful search                     which is the last record in the chain that is in the cellar.
 is 5 percent better with early-insertion than with late-                    The average n u m b e r of probes per search is better for
 insertion when there is no cellar and the table is full (i.e.,              varied-insertion than for both late-insertion and early-
 a = 1), but is only 0.5 percent better when a = 0.5                         insertion.
 [1, 5]. A possible disadvantage of early-insertion is that                       The varied-insertion method incorporates the advan-
 earlier colliders tend to be shoved to the rear by later                    tages of early-insertion, but without any of the drawbacks
 ones, which m a y not be desirable in some practical                        described three paragraphs earlier. The records of a
 situations when the records inserted first tend to be                       chain that are in the cellar always come immediately
 accessed more often than those inserted later. Neverthe-                    after the start of the chain. The average n u m b e r of
 less, early-insertion is an improvement over late-insertion                 probes per search for varied-insertion is always less than
 when there is no cellar.                                                    or equal to that for late-insertion and early-insertion.
     When there is a cellar, preliminary studies indicate                    For unsuccessful searches, the expected n u m b e r of
 that search performance is probably worse with early-                       probes for varied-insertion and late-insertion are identi-
 insertion than with Algorithm C, because a chain's rec-                     cal.
 ords that are in the cellar now come at the end of the                           Research is currently underway to determine the
 chain, whereas with late-insertion they come immedi-                        average search times for the varied-insertion method, as
 ately after the start. In the example in Fig. 9(b), the                     well as to find the values of the o p t i m u m address factor
 insertion of JEFF causes both cellar records AL and TOOTIE                  flOVV. We expect that the initialization fi ~-- 0.86 will be
 to move one link further from their hash addresses. That                    preferred in most situations, as it is for late-insertion.
 does not happen with late-insertion in Fig. 9(b).                           The resulting search times for varied-insertion should be
     We shall now introduce a new variant, called varied-                    a slight improvement over late-insertion.
 insertion, that can be shown to be better than both the                          The idea of linking the inserted record into the chain
 late-insertion and early-insertion methods when there is                    immediately after its hash address has been incorporated
 a cellar. When there is no cellar, varied-insertion is                      into the first modification of Algorithm CD in the last
 Fig. 9. Coalesced Hashing, M ' = 11, M = 9, N = 8. (a) Late-insertion; (b) Early-insertion; and (c) Varied-insertion.
                (a)                                                   (b)                                                    (o)
         late-insertion                                       early-insertion                                       varie d-insertion
       a d d r e s s size = 9                                a d d r e s s size = 9                                 a d d r e s s size = 9
 I            A.L.                                      1           A.L.           •                           1           A.L.
 2                       ,        t
                                                        2                                                      2
 3         AUDREY        i :
                         i
                                  :
                                  I
                                                        3        AUDREY            --           ~,             3         AUDREY         •
 4                       ,        [                     4                                                      4
 5                                I
                                                        5                                                      5
 6           DAVE                 t
                                                        6          DAVE            "--                         6          DAVE          --~
 7           JEFF            -11                        7          JEFF            "--                         7          JEFF
 8           MARK            ~        ~-                8          MARK            "                           8          MARK          .I    <---i
                                  I

 9          DONNA                 l
                                      -- -J             9         DONNA                                        9         DONNA                i   •



(10)       TOOTIEAL                                    10)        TOOTLE           --.                       (10)        TOOTIE        --~
(11)                                                   I1)           AL                                      (ii)           AL         -..


        Keys:                    A.L.    AUDREY         AL       TOOTLE          DONNA          MARK         JEFF        DAVE
H a s h Addresses:                 1        3            1          1               3             1            8           1


ave. # probes         per unsuec, search:      ( a ) 1 8 / 9 = 2.0, ( b ) 2 4 / 9 ~ 2.67, ( c ) 1 8 / 9 = 2.0.
ave. # probes         per succ. search:   ( a ) 2 1 / 8 ~ 2 . 6 3 , ( b ) 2 2 / 8 = 2.75, ( c ) 2 0 / 8 = 2.5.

 922
section. It is natural to ask whether the modified deletion        Deletions can be done in one of several ways, anal-
algorithm would preserve randomness for the modified           ogous to the different methods discussed in the last
insertion algorithms presented in this section. The answer     section. In some cases, it is best merely to mark the
is no, but it is possible that the deletion algorithm could    record as "deleted," because there may be pointers to the
make the table better-than-random, as discussed at the         record from somewhere outside the hash table, and
end of the last section. Finding good deletion algorithms      reusing the space could cause problems. Besides, m a n y
for early-insertion and varied-insertion as well as for        large scale database systems undergo periodic reorgani-
late-insertion is a challenging problem.                       zation during low-peak hours, in which the entire table
                                                               (minus the deleted records) is reconstructed from scratch
7.2 Coalesced Hashing with Buckets                             [15]. This method has not been analyzed analytically,
    Hashing is used extensively in database applications       but it seems to have great potential.
and file systems, where the hash table is too large to fit
entirely in main memory and must be stored on external         7.3 Hash Tables Within a Hash Table
devices, like disks and drums. The hash table is sectioned         When the record size R is small compared to the
off into blocks (or pages), each block containing b rec-       block size B, the resulting bucket size b ~ B/R is
ords; transfers to and from main m e m o r y take place a      relatively large. Sequential search through the blocks is
block at a time. Searching time is dominated by the            now too slow. (The block transfer rate no longer domi-
block transfer rate; now the object is to minimize the         nates search times.) Other methods should be used to
expected number of block accesses per search.                  organize the records within blocks.
    Operating systems with a virtual memory environ-               This is especially true with multiattribute indexing, in
ment are designed to break up data structures into blocks      which we can look up records via one of several different
automatically, even though it appears to the programmer        keys. For example, a large university database may allow
that his data structures all reside in main memory. Linear     a student's record to be accessed by specifying either his
probing (see Sec. 5) is often the best hashing scheme to       name, social security number, student I.D., or bank
use in this environment, because successive probes occur       account number. In this case, four hash tables are used.
in contiguous locations and are apt to be in the same          Instead of storing all the records in four different tables,
block. Thus, one or two block accesses are usually suf-        we let the four tables share a single copy of the records.
ficient for lookup.                                            Each hash table entry consists of only the key value, the
    We can do better if we know beforehand where the           link field, and a pointer to the rest of the student record
block divisions occur. We treat each block as a large          (which is stored in some other block). Lookup now
table slot or bucket that can store b records. Let M ' be      requires one extra block access. Continuing our numer-
the total number of buckets. The following modification        ical example, the table record size reduces from R --- 400
of Algorithm C appears in [7].                                 bytes to about R = 12 bytes, since the key occupies
    To process a record with key K, we search for it in        7 bytes, and the two pointer fields presumably can be
the chain of buckets, starting at bucket hash(K). After        squeezed into the remaining 5 bytes. The bucket size b
an unsuccessful search, we insert the record into the last     is now about B / R ..~ 333.
bucket in the chain if there is room, or else we store it in       In such cases where b is rather large and searching
some nonfull bucket and link that bucket to the end of         within a bucket can get expensive, it pays to organize
the chain. We can speed up this last part by maintaining       each bucket as a hash table. The hash function must be
a doubly linked circular list of nonfull buckets, with a       modified to return a binary number at least [log M ' ] +
"roving pointer" marking one of the buckets. Each time         [log b] bits in length; the high-order bits of the hash
we need another nonfull bucket to store a collider, we         address specify one of the M ' buckets (or blocks), and
insert the record into the bucket indicated by the roving      the low-order bits specify one of the b record positions
pointer, and then we reset the roving pointer to the next      within that bucket. Coalesced hashing is a natural
bucket on the list. This helps distribute the records          method to use because the bucket size (in this example,
evenly, because different chains will use different buckets    b = 333) imposes a definite constraint on the number of
(at least until we make one loop through the available-        records that m a y be stored in a block, so it is reasonable
bucket list). When the external device is a disk, block        to try to optimize the amount of space devoted to the
accesses are faster when they occur on the same cylinder,      address region versus the amount of space devoted to the
so we should keep a separate available-bucket list for         cellar.
each cylinder.
    Record size varies from application to application,        7.4 Dynamic Hashing
but for purposes of illustration, we use the following              So far we have not addressed the problem of what to
parameters: the block size B is 4000 bytes; the total          do when overflow occurs--when we want to insert more
record size R is 400 bytes, of which the key comprises 7       records into a hash table that is already full. The c o m m o n
bytes. The bucket size b is approximately B/R = 10.            technique is to place the extra records into an auxiliary
When the size of the bucket is that small, searching in        storage pool and link them to the main table. Search
each bucket can be done sequentially; there is no need         performance remains tolerable as long as the number of
for the record size to be fixed, as long as each record is     insertions after overflow does not get too large. (Guibas
preceded by its length (in bytes).                             [4] analyzes this for the special case of standard coalesced

923
hashing.) Later during the off-hours when the system is       rithms and the design of new strategies that hopefully
 not heavily used, a larger table is allocated and the         will preserve randomness. The variant methods in Sec.
 records are reinserted into the new table.                    7 also pose interesting theoretical and practical open prob-
    This strategy is not viable when database utilization      lems. The search performance of varied-insertion coa-
is relatively constant with time. Several similar methods,     lesced hashing is slightly better than Algorithm C; re-
known loosely as dynamic hashing, have been devised            search is currently underway to analyze its performance
that allow the table size to grow and shrink dynamically       and to determine the optimum setting flopt. One excit-
with little overhead [3, 8, 9]. When the load factor gets      ing aspect of coalesced hashing is that it is an extreme-
too high or when buckets overflow, the hash table grows        ly good technique which very likely can be made even
larger and certain buckets are split, thereby reducing the     more applicable when these open questions are solved.
congestion. If the bucket size is rather large, for example,
if we allow multiattribute accessing, then coalesced hash-
ing can be used to organize the records within a block,        Appendix
as explained above, thus combining this technique with
coalesced hashing in a truly dynamic way.                          For purposes of average-case analysis, we assume
                                                               that an unsuccessful search can begin at any of the M
                                                               address region slots with equal probability. This includes
8. Conclusions                                                 the special case of insertion. Similarly, each record in the
                                                               table has the same chance of being the object of any
    Coalesced hashing is a conceptually elegant and ex-        given successful search. In other words, all searches and
tremely fast method for information storage and re-            insertions involve random keys. This is sometimes called
trieval. This paper has examined in detail several prac-       the Bernoulli probability model
tical issues concerning the implementation of the                  The asymptotic formulas in this section apply to a
method. The analysis and programming techniques pre-           random M'-slot coalesced hash table with address region
sented here should allow the reader to determine whether       size M = [tiM'] and with N -- raM'] inserted records,
coalesced hashing is the method of choice in any given         where the load factor a and the address factor fl are
situation, and if so, to implement an efficient version of     constants in the ranges 0 _< a <- l and 0 < fl _ I. Formal
the algorithm.                                                 derivations are given in [10, I l, 13].
    The most important issue addressed in this paper is
the initialization of the address factor ft. The intricate     Number of Probes Per Search
optimization process discussed in Sec. 4 and the Appen-           The expected number of probes in unsuccessful and
dix can in principle be applied to any implementation of       successful searches, respectively, as M ' ~ oo is
coalesced hashing. Fortunately, there is no need to un-
dertake such a computational burden for each applica-                                  + e -~/B             if a <-- Xfl
tion, because the results presented in this paper apply to
most reasonable implementations. The initialization fl                             1      1                        2
                                                               C'~(M', M) -        ~ + g (~(o/B-~) _ l) 3 - ~ + 2X
   0.86 is recommended in most cases, because it gives
near-optimum search performance for a wide range of
load factors. The graph in Fig. 2 makes it possible to
fine-tune the choice of fl, in case some prior knowledge
about the types and frequencies of the searches is avail-
able.
                                                                                f°I +--
                                                                                       la
                                                                                       2B
                                                                                          l(fl
                                                                                          2
                                                                                                     •)
                                                                                                            ( )
                                                                                                             ifa>__)~fi (AI)

                                                                                                             ifa--<~,fl
    The comparisons in Sec. 5 show that the tuned
coalesced hashing algorithm outperforms several popular                               IB
hashing methods when the load factor is greater then 0.6.                         I+--.
                                                                                      8a
The differences are more pronounced for large records.
The inner search loop in Algorithm C is very short and
simple, which is important for practical implementations.
                                                               CN(M', M)
Coalesced hashing has the advantage over other chaining
methods that it uses only one link field per slot and can                           (3
achieve full storage utilization. The method is especially
suited for applications with a constrained amount of
memory or with the requirement that the records cannot
                                                                                  +~          +X   ),
                                                                                                    +~X
be relocated after they are inserted.
    In applications where deletions are necessary, one of
the strategies described in Sec. 6 should work well in
practice. However, research remains to be done in several      where X is the umque nonnegative solution to the equa-
areas including the analysis of the current deletion algo-     tion

924                                                            Communications                    December 1982
                                                               of                                Volume 25
                                                               the ACM                           Number 12
Implementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/Coalesced

More Related Content

What's hot

Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documentslau
 
Assignment on different types of addressing modes
Assignment on different types of addressing modesAssignment on different types of addressing modes
Assignment on different types of addressing modesNusratJahan263
 
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFICSIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFICijdms
 
1327 Addressing Modes Of 8086
1327 Addressing Modes Of 80861327 Addressing Modes Of 8086
1327 Addressing Modes Of 8086techbed
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
 
2005 fall cs523_lecture_4
2005 fall cs523_lecture_42005 fall cs523_lecture_4
2005 fall cs523_lecture_4abhineetverma
 
Case study how pointer plays very important role in data structure
Case study how pointer plays very important role in data structureCase study how pointer plays very important role in data structure
Case study how pointer plays very important role in data structureHoneyChintal
 
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...Iosif Itkin
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...Marisa Paryasto
 
Financial Networks IV. Analyzing and Visualizing Exposures
Financial Networks IV. Analyzing and Visualizing ExposuresFinancial Networks IV. Analyzing and Visualizing Exposures
Financial Networks IV. Analyzing and Visualizing ExposuresKimmo Soramaki
 
Addressing modes of 8086
Addressing modes of 8086Addressing modes of 8086
Addressing modes of 8086saurav kumar
 
Iaetsd a novel vlsi dht algorithm for a highly modular and parallel
Iaetsd a novel vlsi dht algorithm for a highly modular and parallelIaetsd a novel vlsi dht algorithm for a highly modular and parallel
Iaetsd a novel vlsi dht algorithm for a highly modular and parallelIaetsd Iaetsd
 

What's hot (17)

Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documents
 
Assignment on different types of addressing modes
Assignment on different types of addressing modesAssignment on different types of addressing modes
Assignment on different types of addressing modes
 
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFICSIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
1327 Addressing Modes Of 8086
1327 Addressing Modes Of 80861327 Addressing Modes Of 8086
1327 Addressing Modes Of 8086
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
2005 fall cs523_lecture_4
2005 fall cs523_lecture_42005 fall cs523_lecture_4
2005 fall cs523_lecture_4
 
Ch13
Ch13Ch13
Ch13
 
Case study how pointer plays very important role in data structure
Case study how pointer plays very important role in data structureCase study how pointer plays very important role in data structure
Case study how pointer plays very important role in data structure
 
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...
Alias Calculus for a Simple Imperative Language with Decidable Pointer Arithm...
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...
Implementation of Polynomial – ONB I Basis Conversion - Jurnal Ilmiah Teknik ...
 
Financial Networks IV. Analyzing and Visualizing Exposures
Financial Networks IV. Analyzing and Visualizing ExposuresFinancial Networks IV. Analyzing and Visualizing Exposures
Financial Networks IV. Analyzing and Visualizing Exposures
 
ch12
ch12ch12
ch12
 
Addressing modes of 8086
Addressing modes of 8086Addressing modes of 8086
Addressing modes of 8086
 
8086 addressing modes
8086 addressing modes8086 addressing modes
8086 addressing modes
 
Iaetsd a novel vlsi dht algorithm for a highly modular and parallel
Iaetsd a novel vlsi dht algorithm for a highly modular and parallelIaetsd a novel vlsi dht algorithm for a highly modular and parallel
Iaetsd a novel vlsi dht algorithm for a highly modular and parallel
 

Viewers also liked

Viewers also liked (7)

Exercício sobre hashing
Exercício sobre hashingExercício sobre hashing
Exercício sobre hashing
 
Aquece Para a prova de EDA3
Aquece Para a prova de EDA3Aquece Para a prova de EDA3
Aquece Para a prova de EDA3
 
Coalesced hashing / Hash Coalescido
Coalesced hashing / Hash CoalescidoCoalesced hashing / Hash Coalescido
Coalesced hashing / Hash Coalescido
 
Projetos de algoritmos com implementações em pascal e c (nivio ziviani, 4ed)
Projetos de algoritmos com implementações em pascal e c (nivio ziviani, 4ed)Projetos de algoritmos com implementações em pascal e c (nivio ziviani, 4ed)
Projetos de algoritmos com implementações em pascal e c (nivio ziviani, 4ed)
 
Introdução a estrutura de dados
Introdução a estrutura de dadosIntrodução a estrutura de dados
Introdução a estrutura de dados
 
Listas em C
Listas em CListas em C
Listas em C
 
Pilhas e Filas
Pilhas e FilasPilhas e Filas
Pilhas e Filas
 

Similar to Implementação do Hash Coalha/Coalesced

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
 
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...pijans
 
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...pijans
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...journalBEEI
 

Similar to Implementação do Hash Coalha/Coalesced (20)

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
 
asdfew.pptx
asdfew.pptxasdfew.pptx
asdfew.pptx
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
 
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
SENSOR SIGNAL PROCESSING USING HIGH-LEVEL SYNTHESIS AND INTERNET OF THINGS WI...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Implementação do Hash Coalha/Coalesced

  • 1. 1. Introduction Programming Techniques Ellis Horowitz One of the primary uses today for computer technol- and Data Structures Editor ogy is information storage and retrieval. Typical search- ing applications include dictionaries, telephone listings, Implementations for medical databases, symbol tables for compilers, and storing a company's business records. Each package of Coalesced Hashing information is stored in computer memory as a record. We assume there is a special field in each record, called Jeffrey Scott Vitter the key, that uniquely identifies it. The job of a searching Brown University algorithm is to take an input K and return the record (if any) that has K as its key. Hashing is a widely used searching technique because The coalesced hashing method is one of the faster no matter how many records are stored, the average searching methods known today. This paper is a practical search times remain bounded. The common element of study of coalesced hashing for use by those who intend all hashing algorithms is a predefined and quickly com- to implement or further study the algorithm. Techniques puted hash function are developed for tuning an important parameter that relates the sizes of the address region and the cellar in hash: (all possible keys) --~ (1, 2 . . . . . M} order to optimize the average running times of different implementations. A value for the parameter is reported that assigns each record to a hash address in a uniform that works well in most cases. Detailed graphs explain manner. (The problem of designing hash functions that how the parameter can be tuned further to meet specific justify this assumption, even when the distribution of the needs. The resulting tuned algorithm outperforms several keys is highly biased, is well-studied [7, 2].) Hashing well-known methods including standard coalesced hash- methods differ from one another by how they resolve a ing, separate (or direct) chaining, linear probing, and collision when the hash address of the record to be double bashing. A variety of related methods are also inserted is already occupied. analyzed including deletion algorithms, a new and im- This paper investigates the coalesced hashing algo- proved insertion strategy called varied-insertion, and ap- rithm, which was first published 22 years ago and is still plications to external searching on secondary storage one of the faster known searching methods [16, 7]. The devices. total number of available storage locations is assumed to be fixed. It is also convenient to assume that these CR Categories and Subject Descriptors: D.2.8 [Soft- locations are contiguous in memory. For the purpose of ware Engineering]: Metrics--performance measures; E.2 notation, we shall number the hash table slots 1, 2 . . . . . [Data]: Data Storage Representations--hash-table rep- M'. The first M slots, which serve as the range of the resentations; F.2.2 [Analysis of Algorithms and Problem hash function, constitute the address region. The remain- Complexity]: Nonnumerical Algorithms and Problems-- ing M ' - - M slots are devoted solely to storing records sorting and searching; H.2.2 [Database Management]: that collide when inserted; they are called the cellar. Physical Design--access methods; H.3.3 [Information Once the cellar becomes full, subsequent colliders must Storage and Retrieval]: Information Search and Re- be stored in empty slots in the address region and, thus, trieval-search process may trigger more collisions with records inserted later. General Terms: Algorithms, Design, Performance, For this reason, the search performance of the coa- Theory lesced hashing algorithm is very sensitive to the relative Additional Key Words and Phrases: analysis of algo- sizes of the address region and cellar. In Sec. 4, we apply rithms, coalesced hashing, hashing, data structures, data- the analytic results derived in [10, I1, 13] in order to bases, deletion, asymptotic analysis, average-case, opti- optimize the ratio of their sizes, fl = M/M', which we mization, secondary storage, assembly language call the address factor. The optimizations are based on two performance measures: the number of probes per This research was supported in part by a National Science Foun- search and the running time of assembly language ver- dation fellowship and by National Science Foundation grants MCS- sions. There is no unique best choice for fl--the optimum 77-23738 and MCS-81-05324. Author's Present Address: Jeffrey Scott Vitter, Department of address factor depends on the type of search, the number Computer Science, Box 1910, Brown University, Providence, RI of inserted records, and the performance measure cho- 02912. s e n - b u t we shall see that the compromise choice fl Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct 0.86 works well in many situations. The method can be commercial advantage, the ACM copyright notice and the title of the further turned to meet specific needs. publication and its date appear, and notice is given that copying is by Section 5 shows that this tuned method dominates permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. several popular hashing algorithms including standard © 1982 ACM 0001-0782/82/1200-0911 $00.75. coalesced hashing (in which fl = 1), separate (or direct) 911 Communications December 1982 of Volume 25 the ACM Number 12
  • 2. chaining, linear probing, and double hashing. The last of both coalesced hashing and separate chaining, because three sections deal with variations and different imple- the cellar is large enough to store the three colliders. mentations for coalesced hashing including deletion al- Figures l(b) and l(c) show how the two methods gorithms, alternative insertion methods, and external differ. The cellar contains only one slot in the example searching on secondary storage devices. in Fig. l(b). When the key MARKcollides with DONNA at This paper is designed to provide a comprehensive slot 4, the cellar is already full. Separate chaining would treatment of the many practical issues concerned with report overflow at this point. The coalesced hashing the implementation of the coalesced hashing method. method, however, stores the key MARK in the largest- Readers interested in the theoretical justification of the numbered empty space (which is location 10 in the results in this paper can consult [10, 11, 13, 14, 1]. address region). This causes a later collision when DAVE hashes to position 10, so DAVE is placed in slot 8 at the end of the chain containing DONNA and MARK. The method derives its name from this "coalescing" of rec- 2. The Coalesced Hashing Algorithm ords with different hash addresses into single chains. The average number of probes per search shows The algorithm works like this: Given a record with marked improvement in Fig. l(b), even though coalesc- key K, the algorithm searches for it in the hash table, ing has occurred. Intuitively, the larger address region starting at location hash(K) and following the links in spreads out the records more evenly and causes fewer the chain. If the record is present in the table, then it is collisions, i.e., the hash function can be thought of as found and the search is successful; otherwise, the end of "shooting" at a bigger target. The cellar is now too small the chain is reached and the search is unsuccessful. For to store these fewer colliders, so it overflows. Fortunately, simplicity, we assume that the record is inserted when- this overflow occurs late in the game, and the pileup ever the search ends unsuccessfully, according to the phenomenon of coalescing is not significant enough to following rule: If position hash(K) is empty, then the counteract the benefits of a larger address region. How- record is stored at that location; else, it is placed in the ever, in the extreme case when M = M ' = 11 and there largest-numbered empty slot in the table and is linked to is no cellar (which we call standard coalesced hashing), the end of the chain. This has the effect of putting the coalescing begins too early and search time worsens (as first M ' - - M colliders into the cellar. typified by Figure l(c)). Determining the optimum ad- Coalesced hashing is a generalization of the well- dress factor fl = M/M' is a major focus of this paper. known separate (or direct) chaining method. The sepa- The first order of business before we can start a rate chaining method halts with overflow when there is detailed study of the coalesced hashing method is to no more room in the cellar to store a collider. The formalize the algorithm and to define reasonable example in Fig. 1(a) can be considered to be an example measures of search performance. Let us assume that each Fig. 1. Coalesced hashing, M ' = 11, N = 8. T h e sizes of the address region are (a) M = 8, (b) M = 10, a n d (c) M = I I . (a) (b) (c) address size = 8 a d d r e s s s i z e = 10 a d d r e s s size = 11 1 JEFF 1 A.L. : 1 2 AUDREY 2 2 3 3 JEFF 3 AUDREY 4 DONNA 4 D N A O N ~ 4 MARK 5 A.L. 5 5 AL 6 6 A D E U RY 6 7 TOOTIE 7 7 DAVE ~_ 8 8 JEFF (9) DAVE 9 AL DONNA i (10) MARK i0 MARK / i° TOOTLE (Ii) AL (11) TOOTLE < 11 A.L. Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVE (a) s 2 2 7 4 5 1 2 Hash Addresses: (b) 1 6 9 1 4 4 3 10 (o) 11 a 5 3 10 4 10 9 average # probes per successful search: (a) 1 2 / 8 = 1.5. (b) l l / 8 = 1.375. (c) 1 4 / 8 = 1.75. 912 Communications D e c e m b e r 1982 of V o l u m e 25 the A C M N u m b e r 12
  • 3. of the M ' contiguous slots in the coalesced hash table In this paper, we concern ourselves with measuring has the following organization: the searching phase of Algorithm C and ignore for the most part the insertion time in steps C5 and C6. (The E time for step C5 is not significant, because the total M number of times R is decremented over the course of all P KEY other fields LINK the insertions cannot be more than the number of in- T serted records; hence, the amortized expected number of Y decrements is at most 1. The decrementing operation can also be done in parallel with steps C 1-C4.) Our primary For each value of i between 1 and M', E M P T Y [i] is a measure of search performance is the number of probes one-bit field that denotes whether the ith slot is unused, per search, which is the number of different table slots KEY[i] stores the key (if any), and LINK[i] is either the that are accessed while searching. In Algorithm C, this index to the next spot in the chain or else the null value quantity is equal to 0. The algorithms in this article are written in the max{ 1, number of times step C3 is performed} English-like style used by Knuth in order to make them For example, in Fig. l(b), the unsuccessful searches for readily understandable to all and to facilitate compari- keys A.L. and TOOTIE (immediately prior to their inser- sons with the algorithms contained in [7, 4, 12]. Block- tions) each took one probe, while a successful search for structured languages, like P L / I and Pascal, are good for DAVE would take two probes. expressing complicated program modules; however, they The average performance of the algorithm is ob- are not used here, because hashing algorithms are so tained by assuming that all searches and insertions are short that there is no reason to discriminate against those random. The Appendix contains a discussion of the who are not comfortable with such languages. probability model as well as the formulas for the ex- Algorithm C (Coalesced hashing search and insertion). pected number of probes in unsuccessful and successful This algorithm searches an M'-slot hash table, looking searches. for a given key K. If the search is unsuccessful and the table is not full, then K is inserted. 3. Assembly Language Implementation The size of the address region is M; the hash function hash returns a value between 1 and M (inclusive). For Even though probe-counting gives us a good idea of convenience, we make use of slot 0, which is always search performance, other factors (such as the complexity empty. The global variable R is used to find an empty of the search loop and the overhead is computing the space whenever a collision must be stored in the table. hash address) also affect the running time when Algo- Initially, the table is empty, and we have R = M ' + 1; rithm C is programmed for a real computer. For com- when an empty space is requested, R is decremented pleteness, we optimize the running time of assembly until one is found. We assume that the following initial- language versions of coalesced hashing. izations have been made before any searches or inser- We choose to program in assembly language rather tions are performed: M ~ [tiM'], for some constant than in some high-level language like Fortran, PL/I, or 0 < fl _< 1; EMPTY[i] ,,-- true, for all 0 _< i _< M'; and Pascal, in order to achieve maximum possible efficiency. R ~ - - M ' + 1. Top efficiency is important in large-scale applications of C1. [Hash.] Set i ~-- hash(K). (Now 1 _< i _< M.) hashing, but it can also be achieved in smaller systems C2. [Is there a chain?] If EMPTY[i], then go to step C6. with little extra effort, because hashing algorithms are so (Otherwise, the ith slot is occupied, so we will look short that implementing them (even in assembly lan- at the chain of records that starts there.) guage) is easy. We use a hypothetical language based on C3. [Compare.] I f K = KEY[i], the algorithm terminates Knuth's Mix [6] because its features are similar to most successfully. well-known machines and its inherent simplicity allows C4. [Advance to next record.] If LINK[i] ~ O, then set us to write programs in clear and concise form. i ~ LINK[i] and go back to step C3. Program C below is a Mix-like implementation of C5. [Find empty slot.] (The search for K in the chain Algorithm C. Liberties have been taken with the lan- was unsuccessful, so we will try to find an empty guage for purposes of clarity; the actual Mxx code appears table slot to store K.) Decrease R one or more times in [10]. The program is written in a five-column format: until EMPTY[R] becomes true. I f R = 0, then there the first column gives the line numbers, the second are no more empty slots, and the algorithm termi- column lists the instruction labels, the third column nates with overflow. Otherwise, append the Rth cell contains the assembly language instructions, the fourth to the chain by setting LINK[i] ~-- R; then set i column counts the number of times the instructions are R. executed, and the last column is for comments that C6. [Insert new record.] Set EMPTY[i] <--false, KEY[i] explain what the instructions do. The syntax of the K, LINK[i] ~-- O, and initialize the other fields in commands should be clear to those familiar with assem- the record. • bly language programming. The four memory registers 913 Communications December 1982 of Volume 25 the ACM N u m b e r 12
  • 4. used in Program C are named rA, rX, rI, and rJ. The field: empty slots are marked by a - 1 in the L I N K field reference KEY(I) denotes the contents of the m e m o r y of that slot. Null links are denoted by a 0 in the L I N K location whose address is the value of K E Y plus the field. The variable R and the key K are stored in memory contents of rI. (This is KEY[i] in the notation of Algo- locations R and K. Registers rI and rA are used to store rithm C.) the values of i and K. Register rJ stores either the value Program C (Coalesced hashing search and insertion). of LINK[i] or R. The instruction labels SUCCESS and This program follows the conventions of Algorithm C, O V E R F L O W are for exiting and are assumed to lie except that the E M P T Y field is implicit in the L I N K somewhere outside this code. I 01 S T A R T LD X, K 1 Step C1. Load rX with K. 02 ENT A, 0 1 Enter 0 into rA. 03 DIV =M= 1 rA ~ [K/M], rX ~-- K mod M. 04 ENT I, X 1 Enter rX into rI. 05 INC I, 1 1 Increment rI by 1. 06 LD A, K 1 Load rA with K. 07 LD J, L I N K ( I ) 1 Step C2. Load rJ with LINK[i]. 08 JN J, STEP6 1 J u m p to STEP6 if LINK[i] < O. 09 CMP A, KEY(l) A Step C3. C o m p a r e K with KEY[i]. 10 JE SUCCESS A Exit (successessfully) if K = KE Y[i]. 11 JZ J, STEP5 A - SI J u m p to STEP5 if LINK[i] = O. 12 STEP4 ENT I, J C - 1 Step C4. Enter rJ into rI. 13 CMP A, KEY(I) C - 1 Step C3. C o m p a r e K with KEY[i]. 14 JE SUCCESS C- 1 Exit (successessfully) if K = KEY[i]. 15 LD J, L I N K ( I ) C - 1 - $2 Load rJ with LINK[i]. 16 JNZ J, STEP4 C - 1 - $2 J u m p to STEP4 if LINK[i] ~ O. 17 STEP5 LD J, R A - S Step C5. Load rJ with R. 18 DEC J, 1 T Decrement R by 1. 19 LD X, L I N K ( J ) T Load rX with LINK[R]. 20 JNN X, .-2 T G o back two steps if LINK[R] >_ O. 21 JZ J, O V E R F L O W A - S Exit (with overflow) if R = 0. 22 ST J, L I N K ( I ) A - S Store R in LINK[i] 23 ENT I, J A - S Enter rJ into rI. 24 ST J, R A - S Update R in memory. 25 STEP6 ST 0, L I N K ( I ) 1- S Step C6. Store 0 in LINK[i]. 26 ST A, KEY(I) 1- S Store K i~ KEY[i]. • The execution time is measured in MIX units of time, The fourth column of Program C expresses the num- which we denote u. The n u m b e r of time units required ber of times each instruction is executed in terms of the by an instruction is equal to the number of m e m o r y quantities references (including the reference to the instruction C = n u m b e r of probes per search. itself). Hence, the LD, ST, and CMP instructions each A = 1 if the initial probe found an occupied slot, take two units of time, while ENT, INC, DEC, and the 0 otherwise. j u m p instructions require only one time unit. The divi- S = 1 if successful, 0 if unsuccessful. sion operation used to compute the hash address is an T = n u m b e r of slots probed while looking for an empty exception to this rule; it takes 14u to execute. space. The running time of a MIX program is the weighted sum We further decompose S into S 1 + $2, where S 1 = 1 if the search is successful on the first probe, and S1 = 0 # times '~// # time units '~ otherwise. By formula (1), the total running time of the the i n s t r u c t i o n ~ required by ~ (1) searching phase is each instruction is executed / t h e instruction] in the program (7C + 4A + 17 - 3S + 2 S l ) u (2) This is a somewhat simplistic model, since it does not and the insertion of a new record after an unsuccessful make use of cache or buffered m e m o r y for fast access of search (when S = 0) takes an additional (SA + 4 T + 4)u. frequently used data, and since it ignores any interven- The average running time is the expected value of (2), tion by the operating system. But it places all hashing assuming that all insertions and searches are random. algorithms on an equal footing and gives a good indi- The formula can be obtained by replacing the variables cation of relative merit. in Eq. (2) with their expected values. 914 Communications D e c e m b e r 1982 of V o l u m e 25 the ACM N u m b e r 12
  • 5. 4. Tuning fl to Obtain Optimum Performance 4.2 MIX Running Times Optimizing the MIX execution times could be tricky, The purpose of the analysis in [10, 11, 13] is to show in general, because the formulas might have local as well how the average-case performance of the coalesced hash- as global minima. Then when we set the derivatives ing method varies as a function of the address factor fl equal to 0 in order to find floPr, there might be several = M / M ' and the load factor a = N/M'. In this section, roots to the resulting equations. The crucial fact that lets for eachfixed value of a, we make use of those results in us apply the same optimization techniques we used above order to "tune" our choice of fl and speed up the search for the number of probes is that the formulas for the MIX times. Our two measures of performance are the expected running times are well-behaved, as shown in the Appen- number of probes per search and the average running dix. By that we mean that each formula is minimized at time of assembly language versions. In the latter case, a unique floPT, which occurs either at the endpoint a = we study a MIX implementation in detail, and then show Aft or at the unique point in the "a > Aft" region where how to apply what we learn to other assembly languages. the derivative w.r.t, fl is 0. Unfortunately, there is no single choice of fl that The optimization procedure is the same as before. yields best results: the optimum choice flOPWis a function The expected values of formulas (A4) and (A5), which of the load factor a and it is even different for unsuc- give the MIX running times for unsuccessful and success- cessful and successful searches. The section concludes ful searches, are functions of the three variables a, fl, and with practical tips on how to initialize ft. In particular, A. We substitute Eq. (3) into the expected running times we shall see that the choice fl = 0.86 works well in most in order to express fl in terms of A. For several different situations. load factors c~ and for each type of search, we find the value of A that minimizes the formula, and then we 4.1 Number of Probes Per Search retranslate this value via Eq. (3) to get flOPW.Figure 2(b) For each fixed value of a, we want to find the values graphs these optimum values flOPW as a function of a; flOPT that minimize the expected number of search probes spline interpolation was used to fill in the gaps. As in the in unsuccessful and successful searches. Formulas (A1) previous section, the formulas for the average unsuccess- and (A2) in the Appendix express the average number ful and successful search times yield different optimum of probes per search as a function of three variables: the address factors. For the successful search case, notice load factor c~ = N/M', the address factor fl = M/M', how closely flOPT agrees with the corresponding values and a new variable A = L/M, where L is the expected that minimize the expected number of probes. number of inserted records needed to make the cellar become full. The variables fl and A are related by the formula 1 Fig. 2. The values //OPT that optimize search performance for the e -~ + A = - (3) following three measures: (a) the expected number of probes per B search, (b) the expected running time of Program C, and (c) the expected assembly language running time for large keys. Formulas (A1) and (A2) each have two cases, "a _< Aft" and "a _> Aft," which have the following intuitive 1.o ~ 1.0 meanings: The condition a < Aft means that with high probability not enough records have been inserted to fill up the cellar, while the condition a > Aft means that enough records have been inserted to make the cellar almost surely full. The optimum address factor flOPW is always located Successful somewhere in the "a _> Aft" region, as shown in the Appendix. The rest of the optimization procedure is a ~.~ 0.9 0.9 straightforward application of differential calculus. First, we substitute Eq. (3) into the "a _> Aft" cases of the formulas for the expected number of probes per search in order to express them in terms of only the two ._E variables a and A. For each nonzero fixed value of a, the (b) ~ Uns....... ful ~ k" ~ formulas are convex w.r.t. A and have unique minima. We minimize them by setting their derivatives equal to 0. Numerical analysis techniques are used to solve the 0,8 resulting equations and to get the optimum values of A for several different values of a. Then we reapply Eq. (3) to express the optimum points in terms of ft. The results are graphed in Fig. 2(a), using spline interpolation to fill 0 0.1 0.2 0.3 0,4 0,5 0.6 0.7 0.8 0,9 1.0 in the gaps. ].oad]:actor,a 915 Communications December 1982 of Volume 25 the ACM Number 12
  • 6. 4.3 Applying the Results to Other Implementations One strategy is to pick fl = 0.782, which minimizes Our MIX analysis suggests two important principles the expected number of probes per unsuccessful search to be used in finding/?OPT for a particular implementa- as well as the average MIX unsuccessful search time when tion of coalesced hashing. First, the formulas for the the table is full (i.e., load factor a = l), as indicated in expected number of times each instruction in the pro- Fig. 2. This choice of/3 yields the best absolute bound gram is executed (which are expressed for Program C in on search performance, because when the table is full, terms of C, A, S, S 1, $2, and T) may have the two cases, search times are greatest and unsuccessful searches av- "a --< )~/3" and "a _> )~/3," but probably not more. erage slightly longer than successful ones. Regardless of Second, the same optimization process as above can the load factor, the expected number of probes per search be used to find /3OPT, because the formulas for the would be at most 1.79, and the average MIX searching running times should be well-behaved for the following time would be bounded by 33.52u. reason: The main difference between Program C and Another strategy is to pick some compromise address another implementation is likely to be the relative time factor that leads to good overall performance for a large it takes to process each key. (The keys are assumed to be range of load factors. A reasonable choice is/3 = 0.86; very small in the MIX version.) Thus, the unsuccessful then the unsuccessful searches are optimized (over all search time for another implementation might be ap- other values o f f l ) when the load factor is =0.68 (number proximately of probes) and ,~0.56 (MIX), and the successful search performance is optimized at load factors -~0.94 (number [(2x + 5)C + (2x + 2)A + ( - 2 x + 19)]u' (4) of probes) and -~0.95 (MIX). where u' is the standard unit of time on the other Figures 3 through 6 graph the expected search per- computer and x is how many times longer it takes to formance of coalesced hashing as a function of a for process a key (multiplied by u/u'). Successful search both types of searches (unsuccessful and successful) and times would be about for both measures of performance (number of probes and MiX running time). The C1 curve corresponds to [(2x + 5 ) C + 18 + 2 S 1 ] u ' (5) standard coalesced hashing (i.e., fl = l); the Co.86 line is Formulas (4) and (5) were calculated by increasing the our compromise choice fl = 0.86; and the dashed line execution times of the key-processing steps 9 and 13 in CoPx represents the best possible search performance Program C by a factor of x. (See formulas (A4) and (A5) that could be achieved by tuning (in which fl is optimized for the x = 1 case.) We ignore the extra time it takes to for each load factor). load the larger key and to compute the hash function, Notice that the value/3 = 0.86 yields near-optimum since that does not affect the optimization. search times once the table gets half-full, so this compro- The role of C in formula (4) is less prevalent than in mise offers a viable strategy. Of course, if some prior (A4) as x gets large: the ratio of the coefficients of C and knowledge about the types and frequencies of the A decreases from 7/4 in (A4) and approaches the limit searches were available, we could tailor our choice of/3 2/2 = 1 in formula (4). Even in this extreme case, to meet those specific needs. however, computer calculations show that the formula for the average running time is well-behaved. The values of/3OPT that minimize formula (4) when x is large are 5. Comparisons graphed in Fig. 2(c). For successful searches, however, the value of C more In this section, we compare the searching times of the strongly dominates the running times for larger values of coalesced hashing method with those from a represent- x, so the limiting values offloPw in Fig. 2(c) coincide with ative collection of hashing schemes: standard coalesced the ones that minimize the expected number of probes hashing (C1), separate chaining (S), separate chaining per search in Fig. 2(a). Figure 2(b) shows that the with ordered chains (SO), linear probing (L), and double approximation is close even for the case x = l, which is hashing (D). Implementations of the methods are given Program C. in [10]. These methods were chosen because they are the 4.4 How to Choose fl most well-known and since they each have implemen- It is important to remember that the address region tations similar to that of Algorithm C. Our comparisons size M = [tiM'] must be initialized when the hash table are based both on the expected number of probes per is empty and cannot change thereafter. Unfortunately, search as well as on the average MIX running time. the last two sections show that each different load factor Coalesced hashing performs better than the other a requires a different optimum address factor /3OPT; in methods. The differences are not so dramatic with the fact, the values of flOPW differ for unsuccessful and suc- MIX search times as with the number of probes per cessful searches. This means that optimizing the average search, due to the large overhead in computing the hash unsuccessful (or successful) search time for a certain load address. However, if the keys were larger and compari- factor a will lead to suboptimum performance when the sons took longer, the relative MIX savings would closely load factor is not equal to a. approximate the savings in number of probes. 916 Communications December 1982 of Volume 25 the ACM Number 12
  • 7. Fig. 3. The average number of probes per unsuccessful search, as M Fig. 4. The average number of probes per successful search, as M and and M' --~ ~, for coalesced hashing (C,, Co.86, COPT for fl = 1, 0.86, M' ---> o0, for coalesced hashing (C,, C0.~6, COPT for fl = 1, 0.86, floP'r), flovr), separate chaining (S), separate chaining with ordered chains separate chaining (S), separate chaining with ordered chains (SO), (SO), linear probing (L), and double hashing (D). linear probing (L), and double hashing (D). 25 L / l 2.5 2.0 Y 2.0 ~ 2.0 1.5 / / 1.5 ;. L5 / / ~ s ~ S, SO ~ so I.O / ~ 1.0 0 0.1 0,2 0.3 0.4 0.5 0.h 0.7 0.8 0.9 1.0 0 0. I 02 0.3 0.4 0,5 06 0.7 08 0.9 1,0 l.oadfactor, a or l.oadl,,ctor,a or 5.1 Standard Coalesced Hashing (C1) "tuned" coalesced hashing are identical. Figures 3 and Standard coalesced hashing is the special case of 4 show that the savings in number of probes per search coalesced hashing for which fl = 1 and there is no cellar. can be as much as 14 percent (unsuccessful) and 6 This is obviously the most realistic comparison that can percent (successful). In Figs. 5 and 6, the corresponding be made, because except for the initialization of the savings in MIX searching time is 6 percent (unsuccessful) address region size, standard coalesced hashing and and 2 percent (successful). Fig. 5. The average Mix execution time per unsuccessful search, as Fig. 6. The average Mix execution time per successful search, as M' ---> oo, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, flOPT), M' --> ~, for coalesced hashing (C,, C0.s6, CoPy for fl = 1, 0.86, floPT), separate chaining (S), separate chaining with ordered chains (SO), separate chaining (S), separate chaining with ordered chains (SO), linear probing (L), and double hashing (D). linear probing (L), and double hashing (D). 40 40 L~D 40 { Cl 35 35 35 30 ~o ~ ~ 30 .= --- x ~. 25 25 ~ J 20 20 20 0 0,1 0.2 0,3 0.4 0.5 0.6 0.7 0,8 0.9 1.0 0 0.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Load Factor, a I.oadFaca)r, a 917 Communications December 1982 of Volume 25 the A C M Number 12
  • 8. 5.2 Separate (or Direct) Chaining (S) cessful search time of Program SO is worse than Program The separate chaining method is given an unfair C's, and in real-life situations, the difference is likely to advantage in Figs. 3 and 4: the number of probes per be more apparent, because records that are inserted first search is graphed as a function of ~ = N / M rather than tend to be looked up more often and should be kept near a = N / M ' and does not take into account the number of the beginning of the chain, not rearranged. auxiliary slots used to store colliders. In order to make Method SO has the same storage limitations as the the comparison fair, we must adjust the load factor separate chaining scheme (i.e., the table usually over- accordingly. flows when N = M = 0.731M'), whereas coalesced Separate chaining implementations are designed of- hashing can obtain full storage utilization. ten to accommodate about N = M records; an average of M(1 - 1 / M ) M ~ M / e auxiliary slots are needed to 5.4 Linear Probing (L) and Double Hashing (D) store the colliders. The total table size is thus M ' = M When searching for a record with key K, the linear + M/e. Solving backwards for M, we get M = 0.731M'. probing method first checks location hash(K), and if In other words, we may consider separate chaining to be another record is already there, it steps cyclically through the special case of coalesced hashing for which /3 -~ the table, starting at location hash(K), until the record is 0.731, except that no more records can be inserted once found (successful search) or an empty slot is reached the cellar overflows. Hence, the adjusted load factor is (unsuccessful search). Insertions are done by placing the a = 0.731~, and overflow occurs when there are around record into the empty slot that terminated the unsuc- N = M = 0.73 I M ' inserted records. (This is a reasonable cessful search. Double hashing generalizes this by letting space/time compromise: if we make M smaller, then the cyclic step size be a function of K. more records can usually be stored before overflow We have to adjust the load factor in the opposite occurs, but the average search times blow up; if we direction when we compare Algorithm C with methods increase M to get better search times, then overflow L and D, because the latter do not require L I N K fields. occurs much sooner, and many slots are wasted.) For example, if we suppose that the L I N K field com- If we adjust the load factors in Figs. 3 and 4 in this prises ¼ of the total record size in a coalesced hashing way, Algorithm C generates better search statistics: the implementation, then the search statistics in Figs. 3 and expected number of probes per search for separate chain- 4 for Algorithm C with load factor a should be compared ing is -~ 1.37 (unsuccessful) and -~ 1.5 (successful) when against those for linear probing and double hashing with the load factor 6 is 1, while that for coalesced hashing is load factor (¼)a. In this case, the average number of 1.32 (unsuccessful) and -~ 1.44 (successful) when the probes per search is still better for coalesced hashing. load factor a =/3~ is equal to 0.731. However, the L I N K field is often much smaller than The graphs in Figs. 5 and 6 already reflect this load the rest of the record, and sometimes it can be included factor adjustment. In fact, the MIX implementation of in the table at virtually no extra cost. The Mix imple- separate chaining (Program S in [10]) is identical to mentation Program C in [10] assumes that the raix field Program C, except that /3 is initialized to 0.731 and can be squeezed into the record without need of extra overflow is signaled automatically when the cellar runs storage Space. Figures 5 and 6, therefore, require no load out of empty slots. Program C is slightly quicker in MIX factor adjustment. execution time than Program S, but more importantly, To balance matters, the M~X implementations of lin- the coalesced hashing implementation is more space ear probing and double hashing, which are given in [10] efficient: Program S usually overflows when a = 0.731, and [7], contain two code optimizations. First, since while Program C can always obtain full storage utiliza- L I N K fields are not used in methods L and D, we no tion a = 1. This confirms our intuition that coalesced longer need 0 to denote a null L I N K , and we can hashing can accomodate more records than the separate renumber the table slots from 0 to M ' - 1; the hash chaining method and still outperform separate chaining function now returns a value between 0 and M ' - 1. before that method overflows. This makes the hash address computation faster by lu, because the instruction INC I, 1 can be eliminated. 5.3 Separate Chaining with Ordered Chains (SO) Second, the empty slots are denoted by the value 0 in This method is a variation of separate chaining in order to make the comparisons in the inner loop as fast which the chains are kept ordered by key value. The as possible. This means that records are not allowed to expected number of probes per successful search does have a key value of 0. The final results are graphed in not change, but unsuccessful searches are slightly Figs. 5 and 6. Coalesced hashing clearly dominates when quicker, because only about half the chain needs to be the load factor is greater than 0.6. searched, on the average. Our remarks about adjusting the load factor in Figs. 3 and 4 also apply to method SO. But even after that is 6. Deletions done, the average number of probes per unsuccessful search as well as the expected MIX unsuccessful search It is often useful in hashing applications to be able to time is slightly better for this method than for coalesced delete records when they no longer logically belong to hashing. However, as Fig. 6 illustrates, the average suc- the set of objects being represented in the hash table. For 918 Communications December 1982 of Volume 25 the A C M N umbe r 12
  • 9. example, in an airlines reservations system, passenger TOOTIE rehashes to the hole in location 10, so TOOTIE records are often expunged soon after the flight has moves up to plug the hole, leaving a new hole in position taken place. 9. Next, DONNA collides with AUDREY during rehashing, One possible deletion strategy often used for linear so DONNA remains in slot 8 and is linked to AUDREY. probing and double hashing is to include a special one- Then MARK also collides with AUDREY; we leave MARK in bit D E L E T E D field in each record that says whether or position 7 and link it to DONNA, which was formerly at not the record has been deleted. The search algorithm the end of AUDREY'Shash chain. The record JEFF rehashes must be modified to treat each "deleted" table slot as if to the hole in slot 9, so we move it up to plug the hole, it were occupied by a null record, even though the entire and a new hole appears in position 6. Finally, DAVE record is still there. This is especially desirable when rehashes to position 9 and joins JEVF'S chain. there are pointers to the records from outside the table. Location 6 is the current hole position when the I f there are no such external pointers to worry about, deletion algorithm terminates, so we set EMPTY[6] ~-- the "deleted" table slots can be reused for later insertions: true and return it to the pool of empty slots. However, Whenever an empty slot is needed in step C5 of Algo- the value of R in Algorithm C is already 5, so step C5 rithm C, the record is inserted into the first "deleted" will never try to reuse location 6 when an empty slot is slot encountered during the unsuccessful search; if there needed. is no such slot, an empty slot is allocated in the usual We can solve this problem by using an available- way. However, a certain percentage of the "deleted" slots space list in step C5 rather than the variable R; the list probably will remain unused, thus preventing full storage must be doubly linked so that a slot can be removed utilization. Also, insertions and deletions over a pro- quickly from the list in step C6. The available-space list longed period would cause the expected search times to does not require any extra space per table slot, since we approximate those for a full table, regardless of the can use the K E Y and L I N K fields of the empty slots for n u m b e r of undeleted records, because the "deleted" the two pointer fields. (The K E Y field is much larger records make the searches longer. than the L I N K field in typical implementations.) For I f we are willing to spend a little extra time per clarity, we rename the two pointer fields N E X T and deletion, we can do without the D E L E T E D field by P R E V . Slot 0 in the table acts as the d u m m y start of the relocating some of the records that follow in the chain. available-space list, so NEXT[O] points to the first actual The basic idea is this: First, we find the record we want slot in the list and PREV[O] points to the last. Before to delete, mark its table slot empty, and set the L I N K any records are inserted into the table, the following field of its predecessor (if any) to the null value 0. Then extra initializations must be made: NEXT[O] <--- M ' we use Algorithm C to reinsert each record in the re- P R E V [ M ' ] ,--- 0; and N E X T [ i ] ~ i - 1 and P R E V [ i - mainder of the chain, but whenever an empty slot is 1] ~ i, for 1 _< i _< M'. We replace steps C5 and C6 by needed in step C5, we use the position that the record C5. [Find empty slot.] (The search for K in the chain already occupies. was unsuccessful, so we will try to find an empty This method can be illustrated by deleting AL from table slot to store K.) I f the table is already full (i.e., location l0 in Fig. 7(a); the end result is pictured in Fig. NEXT[O] = 0), the algorithm terminates with over- 7(b). The first step is to create a hole in position l0 where flow. Otherwise, set L I N K [ i ] *---NEXT[O] and i *-- AL was, and to set AUDREY'S L I N K field to 0. Then we NEXT[0]. process the remainder of the chain. The next record C6. [Insert new record.] Remove the ith slot from the Fig. 7. (a) Inserting the eight records; (b) Inserting all the records except AL. (a) (b) 1 AUDREY 1 AUDREY 2 2 8 3 4 4 I 5 .. DAVE 5 DAVE 6 JEFF 6 7 MARK -~ 7 MARK 8 DONNA -I 8 DONNA 9 ,TOOTlEAL ""~J-1 9 JEFF i0 I0 TOOTIE Ii A.L. Ii A.L. Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE Hash Addresses: 11 1 1 10 1 1 9 9 919 Communications December 1982 of Volume 25 the ACM Number 12
  • 10. available-space list by setting PREV'[NEXT[i]] ~-- resulting table is better-than-random: the average search PREV[i] and N E X T [ P R E V [ i ] ] ~-- NEXT[i]. Then times after N random insertions and one deletion are set E M P T Y [ i ] ~-- false, KEY[i] ~ K, L I N K [ i ] ~-- sometimes better (and never worse) than they would be 0, and initialize the other fields in the record. with N - 1 random insertions alone. Whether or not this remains true after more than one deletion is an open The following deletion algorithm is analyzed in problem. detail in [10] and [14]. If this deletion algorithm is used when there is a Algorithm CD (Deletion with coalesced hashing). This cellar (i.e., fl < 1), we can modify it so that whenever a algorithm deletes the record with koy K from a coalesced hole appears in the cellar during the execution of Algo- hash table constructed by Algorithm C, with steps C5 rithm CD, the next noncellar record in the chain moves and C6 modified as above. up to plug the hole. Unfortunately, even with this mod- This algorithm preserves the important invariant that ification, the algorithm does not break up chains well K is stored at its hash address if and only if it is at the enough to preserve randomness. It seems possible that start of its chain. This makes searching for K's predeces- search performance may remain very good anyway. sor in the chain easy: if it exists, then it must come at or Analytic and empirical study is needed to determine just after position hash(K) in the chain. "how far from r a n d o m " the search times get after dele- C D I . [Search for K.] Set i ~ hash(K). If E M P T Y [ i ] , tions are performed. Two remarks should be made about implementing then K is not present in the table and the algorithm this modified deletion algorithm. In step CD6, the empty terminates. Otherwise, if K = KEY[i], then K is at slot should be returned to the start of the available-space the start of the chain, so go to step CD3. list when the slot is in the cellar; otherwise, it should be CD2. [Split chain in two.] (K is not at the start of its placed at the end. This has the effect of giving cellar slots chain.) Repeatedly set P R E D ~-- i and i *-- higher priority on the available-space list. Second, if a L I N K [ i ] until either i = 0 or K = KEY[i]. I f i = cellar slot is freed by a deletion and then reallocated 0, then K is not present in the table, and the during a later insertion, it is possible for chain to go in algorithm terminates. Else, set L I N K [ P R E D ] and out of the cellar more than once. Programmers 0. should no longer assume that a chain's cellar slots im- CD3. [Process remainder of chain.] (Variable i will walk mediately follow the start of the chain. through the successors of K in the chain.) Set H O L E ~ i, i ~ LINK[i], L I N K [ H O L E ] ~-- O. Do step CD4 zero or more times until i = 0. Then 7. Implementations and Variations go to step CD5. CD4. [Rehash record in ith slot.] Set j ~ hash(KEY[i]). Most important searching algorithms have several I f j = H O L E , we move up the record to plug the different implementations in order to handle a variety of hole by setting K E Y [ H O L E ] ~-- KEY[i] and applications; coalesced hashing is no exception. We have H O L E ~ i. Otherwise, we link the record to the already discussed some modifications in the last section end of its hash chain by doing the following: set in connection with deletion algorithms. In particular, we j .-- L I N K [ j ] zero or more times until L I N K [ j ] needed to use a doubly linked available-space list so that = 0; then set L I N K [ j ] *-- i. Set k *-- LINK[i], the empty slots could be added and removed quickly. LINK[i] ~ O, and i *-- k. Repeat step CD4 unless Thus, the cellar need not be contiguous. Another strategy i=0. to handle a noncontiguous cellar is to link all the table CDS. [Mark slot H O L E empty.] Set E M P T Y [ H O L E ] slots together initially and to replace "Decrease R " in true. Place H O L E at the start of the available- step C5 of Algorithm C with "Set R *-- L I N K [ R ] . " With space list by setting N E X T [ H O L E ] ~ NEXT[O], either modification, Algorithm C can simulate the sepa- PRE V [ H O L E ] ~-- O, P R E V[NEXT[O]] ~ H O L E , rate chaining method until the cellar empties; subsequent NEXT[O] ~ H O L E . • colliders can be stored in the address region as usual. Algorithm CD has the important property that it Hence, coalesced hashing can have the benefit of dy- preserves randomness for the special case of standard namic allocation as well as total storage utilization. coalesced hashing (when M = M ' ) , in that deleting a Another c o m m o n data structure is to store pointers record is in some sense like never having inserted it. The to the fields, rather than the fields themselves, in the "sense" is strong enough so that the formulas for the table slots. For example, if the records are large, we average search times are still valid after deletions are might want to store only the key and link values in each performed. Exactly what preserving randomness means slot, along with a pointer to where the rest of the record is explained in detail in [14]. is located. We expand upon this idea later in this section. We can speed up the rehashing phase in the latter If we are willing to do extra work during insertion half of step CD4 by linking the record into the chain and if the records are not pointed to from outside the immediately after its hash address rather than at the end table, we can modify the insertion algorithm to prevent of the chain. When this modified deletion algorithm is the chains from coalescing: W h e n a record R1 collides called on a random standard coalesced hash table, the during insertion with another record Rz that is not at the 920 Communications December 1982 of Volume 25 the ACM Number 12
  • 11. start of the chain, we store R, at its hash address and rithm (Algorithm C in Sec. 2) as the late-insertion relocate R2 to some other spot. (The LINK field of R2's method. predecessor must be updated.) The size of the records Early-insertion can be used even if we do not have a should not be very large or else the cost of rearrangement priori knowledge about the record's presence, in which might get prohibitive. There is an alternate strategy that case the entire chain must be searched in order to verify prevents coalescing and does not relocate records, but it that the record is not already stored in the table. We can requires an extra link field per slot and the searches are implement this form of early-insertion by making the slightly longer. One link field is used to chain together following two modifications to Algorithm C. First, we all the records with the same hash address. The other add the assignment "Set j ~-- i" at the end of step C2, so link field contains for slot i a pointer to the start of the that j stores the hash address hash(K). The second chain of records with hash address i. Much of the space modification replaces the last sentence of step C5 by for the link fields is wasted, and chains m a y start one "Otherwise, link the R t h cell into the chain immediately link away from their hash address. Resources could be after the hash addressj by setting LINK[R] ~--LINK[j], put to better use by using coalesced hashing. LINK[j] ~ R; then set i ~ R." This section is devoted to the more nonobvious im- Each chain of records formed using early-insertion plementations of coalesced hashing. First, we describe contains the same records as the corresponding chain alternate insertion strategies and then conclude with formed by late-insertion. Since the length of a random three applications to external searching on secondary unsuccessful search depends only on the number of storage devices. A scheme that allows the coalesced hash records in the chain between the hash address and the table to share m e m o r y with other data structures can be end of the chain, and since all the records are in the found in [ 12]. A generalization of coalesced hashing that address region when there is no cellar, it must be true uses nonuniform hash functions is described in [13]. that the average n u m b e r of probes per unsuccessful search is the same for the two methods if there is no 7.1 Early-Insertion and Varied-lnsertion Coalesced cellar. However, the order of the records within each Hashing chain m a y be different for early-insertion than for late- I f we know a priori that a record is not already insertion. When there is no cellar, the early-insertion present in the table, then it is not necessary in Algorithm algorithm causes the records to align themselves in the C to search to the end of the chain before the record is chains closer to their hash addresses, on the average, inserted: I f the hash address location is empty, the record than would be the case with late-insertion, so the ex- can be inserted there; otherwise, we can link the record pected successful search times are better. into the chain immediately after its hash address by A typical case is illustrated in Fig. 8. The record DAVE rerouting pointers. We call this the early-insertion method collides with A.L. at slot 5. In Fig. 8(a), which uses late- because the collider is linked "early" in the chain, rather insertion, DAVE is linked to the end of the chain contain- than at the end. We will refer to the unmodified algo- ing A.L., whereas if we use early-insertion as in Fig. 8(b), Fig. 8. Standard Coalesced Hashing, M = M ' = 11, N = 8. (a) Late-insertion; (b) Early-insertion. (a) (b) late-insertion early-insertion a d d r e s s s i z e = 11 a d d r e s s s i z e = 11 1 AUDREY 1 AUDREY 2 2 3 DONNA S DONNA 4 JEFF 4 JE~ 5 A.L. ~, 5 A.L. 6 6 7 7 8 ..DAVE DAVE 9 MARK ~ 9 MARK 10 TOOTIE 10 TOOTIE ii AL / . 11 AL -I" Keys: A.L. AUDREY AL TOOTIE DONNA MARK JEFF DAVE Hash Addresses: 5 1 5 10 3 11 4 5 ave. ]/probes per succ. search: ( a ) 1 3 / 8 ~ 1.63, ( b ) 1 2 / 6 = 1.5. 921 Communications D e c e m b e r 1982 of Volume 25 the A C M N u m b e r 12
  • 12. DAVE is linked into the chain at the point between A.L. identical to early-insertion. In the varied-insertion and AL. The average successful search time in Fig. 8(b) method, the early-insertion strategy is used except when is slightly better than in Fig. 8(a), because linking DAVE the cellar is full and the hash address of the inserted into the chain immediately after A.L. (rather than at the record is the start of a chain that has records in the end of the chain) reduces the search time for DAVE from cellar. In that case, the record is linked into the chain four probes to two and increases the search time for AL immediately after the last cellar slot in the chain. from two probes to three. The result is a net decrease of Figure 9(c) shows a typical hash table constructed one probe. using varied-insertion. The cellar is already full when One can show easily that this effect manifests itself the record DAVE is inserted. The hash address of DAVE is only on chains of length greater than 3, so there is little 1, which is at the start of a chain that has records in the improvement when the load factor a is small, since the cellar. Therefore, early-insertion is not used, and DAVE chains are usually short. Recent theoretical results show is instead linked into the chain immediately after AL, that the average number of probes per successful search which is the last record in the chain that is in the cellar. is 5 percent better with early-insertion than with late- The average n u m b e r of probes per search is better for insertion when there is no cellar and the table is full (i.e., varied-insertion than for both late-insertion and early- a = 1), but is only 0.5 percent better when a = 0.5 insertion. [1, 5]. A possible disadvantage of early-insertion is that The varied-insertion method incorporates the advan- earlier colliders tend to be shoved to the rear by later tages of early-insertion, but without any of the drawbacks ones, which m a y not be desirable in some practical described three paragraphs earlier. The records of a situations when the records inserted first tend to be chain that are in the cellar always come immediately accessed more often than those inserted later. Neverthe- after the start of the chain. The average n u m b e r of less, early-insertion is an improvement over late-insertion probes per search for varied-insertion is always less than when there is no cellar. or equal to that for late-insertion and early-insertion. When there is a cellar, preliminary studies indicate For unsuccessful searches, the expected n u m b e r of that search performance is probably worse with early- probes for varied-insertion and late-insertion are identi- insertion than with Algorithm C, because a chain's rec- cal. ords that are in the cellar now come at the end of the Research is currently underway to determine the chain, whereas with late-insertion they come immedi- average search times for the varied-insertion method, as ately after the start. In the example in Fig. 9(b), the well as to find the values of the o p t i m u m address factor insertion of JEFF causes both cellar records AL and TOOTIE flOVV. We expect that the initialization fi ~-- 0.86 will be to move one link further from their hash addresses. That preferred in most situations, as it is for late-insertion. does not happen with late-insertion in Fig. 9(b). The resulting search times for varied-insertion should be We shall now introduce a new variant, called varied- a slight improvement over late-insertion. insertion, that can be shown to be better than both the The idea of linking the inserted record into the chain late-insertion and early-insertion methods when there is immediately after its hash address has been incorporated a cellar. When there is no cellar, varied-insertion is into the first modification of Algorithm CD in the last Fig. 9. Coalesced Hashing, M ' = 11, M = 9, N = 8. (a) Late-insertion; (b) Early-insertion; and (c) Varied-insertion. (a) (b) (o) late-insertion early-insertion varie d-insertion a d d r e s s size = 9 a d d r e s s size = 9 a d d r e s s size = 9 I A.L. 1 A.L. • 1 A.L. 2 , t 2 2 3 AUDREY i : i : I 3 AUDREY -- ~, 3 AUDREY • 4 , [ 4 4 5 I 5 5 6 DAVE t 6 DAVE "-- 6 DAVE --~ 7 JEFF -11 7 JEFF "-- 7 JEFF 8 MARK ~ ~- 8 MARK " 8 MARK .I <---i I 9 DONNA l -- -J 9 DONNA 9 DONNA i • (10) TOOTIEAL 10) TOOTLE --. (10) TOOTIE --~ (11) I1) AL (ii) AL -.. Keys: A.L. AUDREY AL TOOTLE DONNA MARK JEFF DAVE H a s h Addresses: 1 3 1 1 3 1 8 1 ave. # probes per unsuec, search: ( a ) 1 8 / 9 = 2.0, ( b ) 2 4 / 9 ~ 2.67, ( c ) 1 8 / 9 = 2.0. ave. # probes per succ. search: ( a ) 2 1 / 8 ~ 2 . 6 3 , ( b ) 2 2 / 8 = 2.75, ( c ) 2 0 / 8 = 2.5. 922
  • 13. section. It is natural to ask whether the modified deletion Deletions can be done in one of several ways, anal- algorithm would preserve randomness for the modified ogous to the different methods discussed in the last insertion algorithms presented in this section. The answer section. In some cases, it is best merely to mark the is no, but it is possible that the deletion algorithm could record as "deleted," because there may be pointers to the make the table better-than-random, as discussed at the record from somewhere outside the hash table, and end of the last section. Finding good deletion algorithms reusing the space could cause problems. Besides, m a n y for early-insertion and varied-insertion as well as for large scale database systems undergo periodic reorgani- late-insertion is a challenging problem. zation during low-peak hours, in which the entire table (minus the deleted records) is reconstructed from scratch 7.2 Coalesced Hashing with Buckets [15]. This method has not been analyzed analytically, Hashing is used extensively in database applications but it seems to have great potential. and file systems, where the hash table is too large to fit entirely in main memory and must be stored on external 7.3 Hash Tables Within a Hash Table devices, like disks and drums. The hash table is sectioned When the record size R is small compared to the off into blocks (or pages), each block containing b rec- block size B, the resulting bucket size b ~ B/R is ords; transfers to and from main m e m o r y take place a relatively large. Sequential search through the blocks is block at a time. Searching time is dominated by the now too slow. (The block transfer rate no longer domi- block transfer rate; now the object is to minimize the nates search times.) Other methods should be used to expected number of block accesses per search. organize the records within blocks. Operating systems with a virtual memory environ- This is especially true with multiattribute indexing, in ment are designed to break up data structures into blocks which we can look up records via one of several different automatically, even though it appears to the programmer keys. For example, a large university database may allow that his data structures all reside in main memory. Linear a student's record to be accessed by specifying either his probing (see Sec. 5) is often the best hashing scheme to name, social security number, student I.D., or bank use in this environment, because successive probes occur account number. In this case, four hash tables are used. in contiguous locations and are apt to be in the same Instead of storing all the records in four different tables, block. Thus, one or two block accesses are usually suf- we let the four tables share a single copy of the records. ficient for lookup. Each hash table entry consists of only the key value, the We can do better if we know beforehand where the link field, and a pointer to the rest of the student record block divisions occur. We treat each block as a large (which is stored in some other block). Lookup now table slot or bucket that can store b records. Let M ' be requires one extra block access. Continuing our numer- the total number of buckets. The following modification ical example, the table record size reduces from R --- 400 of Algorithm C appears in [7]. bytes to about R = 12 bytes, since the key occupies To process a record with key K, we search for it in 7 bytes, and the two pointer fields presumably can be the chain of buckets, starting at bucket hash(K). After squeezed into the remaining 5 bytes. The bucket size b an unsuccessful search, we insert the record into the last is now about B / R ..~ 333. bucket in the chain if there is room, or else we store it in In such cases where b is rather large and searching some nonfull bucket and link that bucket to the end of within a bucket can get expensive, it pays to organize the chain. We can speed up this last part by maintaining each bucket as a hash table. The hash function must be a doubly linked circular list of nonfull buckets, with a modified to return a binary number at least [log M ' ] + "roving pointer" marking one of the buckets. Each time [log b] bits in length; the high-order bits of the hash we need another nonfull bucket to store a collider, we address specify one of the M ' buckets (or blocks), and insert the record into the bucket indicated by the roving the low-order bits specify one of the b record positions pointer, and then we reset the roving pointer to the next within that bucket. Coalesced hashing is a natural bucket on the list. This helps distribute the records method to use because the bucket size (in this example, evenly, because different chains will use different buckets b = 333) imposes a definite constraint on the number of (at least until we make one loop through the available- records that m a y be stored in a block, so it is reasonable bucket list). When the external device is a disk, block to try to optimize the amount of space devoted to the accesses are faster when they occur on the same cylinder, address region versus the amount of space devoted to the so we should keep a separate available-bucket list for cellar. each cylinder. Record size varies from application to application, 7.4 Dynamic Hashing but for purposes of illustration, we use the following So far we have not addressed the problem of what to parameters: the block size B is 4000 bytes; the total do when overflow occurs--when we want to insert more record size R is 400 bytes, of which the key comprises 7 records into a hash table that is already full. The c o m m o n bytes. The bucket size b is approximately B/R = 10. technique is to place the extra records into an auxiliary When the size of the bucket is that small, searching in storage pool and link them to the main table. Search each bucket can be done sequentially; there is no need performance remains tolerable as long as the number of for the record size to be fixed, as long as each record is insertions after overflow does not get too large. (Guibas preceded by its length (in bytes). [4] analyzes this for the special case of standard coalesced 923
  • 14. hashing.) Later during the off-hours when the system is rithms and the design of new strategies that hopefully not heavily used, a larger table is allocated and the will preserve randomness. The variant methods in Sec. records are reinserted into the new table. 7 also pose interesting theoretical and practical open prob- This strategy is not viable when database utilization lems. The search performance of varied-insertion coa- is relatively constant with time. Several similar methods, lesced hashing is slightly better than Algorithm C; re- known loosely as dynamic hashing, have been devised search is currently underway to analyze its performance that allow the table size to grow and shrink dynamically and to determine the optimum setting flopt. One excit- with little overhead [3, 8, 9]. When the load factor gets ing aspect of coalesced hashing is that it is an extreme- too high or when buckets overflow, the hash table grows ly good technique which very likely can be made even larger and certain buckets are split, thereby reducing the more applicable when these open questions are solved. congestion. If the bucket size is rather large, for example, if we allow multiattribute accessing, then coalesced hash- ing can be used to organize the records within a block, Appendix as explained above, thus combining this technique with coalesced hashing in a truly dynamic way. For purposes of average-case analysis, we assume that an unsuccessful search can begin at any of the M address region slots with equal probability. This includes 8. Conclusions the special case of insertion. Similarly, each record in the table has the same chance of being the object of any Coalesced hashing is a conceptually elegant and ex- given successful search. In other words, all searches and tremely fast method for information storage and re- insertions involve random keys. This is sometimes called trieval. This paper has examined in detail several prac- the Bernoulli probability model tical issues concerning the implementation of the The asymptotic formulas in this section apply to a method. The analysis and programming techniques pre- random M'-slot coalesced hash table with address region sented here should allow the reader to determine whether size M = [tiM'] and with N -- raM'] inserted records, coalesced hashing is the method of choice in any given where the load factor a and the address factor fl are situation, and if so, to implement an efficient version of constants in the ranges 0 _< a <- l and 0 < fl _ I. Formal the algorithm. derivations are given in [10, I l, 13]. The most important issue addressed in this paper is the initialization of the address factor ft. The intricate Number of Probes Per Search optimization process discussed in Sec. 4 and the Appen- The expected number of probes in unsuccessful and dix can in principle be applied to any implementation of successful searches, respectively, as M ' ~ oo is coalesced hashing. Fortunately, there is no need to un- dertake such a computational burden for each applica- + e -~/B if a <-- Xfl tion, because the results presented in this paper apply to most reasonable implementations. The initialization fl 1 1 2 C'~(M', M) - ~ + g (~(o/B-~) _ l) 3 - ~ + 2X 0.86 is recommended in most cases, because it gives near-optimum search performance for a wide range of load factors. The graph in Fig. 2 makes it possible to fine-tune the choice of fl, in case some prior knowledge about the types and frequencies of the searches is avail- able. f°I +-- la 2B l(fl 2 •) ( ) ifa>__)~fi (AI) ifa--<~,fl The comparisons in Sec. 5 show that the tuned coalesced hashing algorithm outperforms several popular IB hashing methods when the load factor is greater then 0.6. I+--. 8a The differences are more pronounced for large records. The inner search loop in Algorithm C is very short and simple, which is important for practical implementations. CN(M', M) Coalesced hashing has the advantage over other chaining methods that it uses only one link field per slot and can (3 achieve full storage utilization. The method is especially suited for applications with a constrained amount of memory or with the requirement that the records cannot +~ +X ), +~X be relocated after they are inserted. In applications where deletions are necessary, one of the strategies described in Sec. 6 should work well in practice. However, research remains to be done in several where X is the umque nonnegative solution to the equa- areas including the analysis of the current deletion algo- tion 924 Communications December 1982 of Volume 25 the ACM Number 12