Hash Tables
The memory available to maintain the symbol table is assumed to be sequential. This memory is referred to as the hash table, HT. The term bucket denotes a unit of storage that can store one or more records. A bucket is typically one disk block size but could be chosen to be smaller or larger than a disk block.
If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ... HT(b-1). Each bucket is capable of holding one or more records. The number of records a bucket can store is known as its slot-size. Thus, a bucket is said to consist of s slots, if it can hold s number of records in it.
A function that is used to compute the address of a record in the hash table, is known as a hash function. Usually, s = 1 and in this case each bucket can hold exactly 1 record.
1. For More Visit: Https://www.ThesisScientist.com
Unit 9
Hash Table
Hash Tables
The memory available to maintain the symbol table is assumed to be sequential. This memory is
referred to as the hash table, HT. The term bucket denotes a unit of storage that can store one or
more records. A bucket is typically one disk block size but could be chosen to be smaller or larger
than a disk block.
If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ... HT(b-1). Each
bucket is capable of holding one or more records. The number of records a bucket can store is known as its
slot-size. Thus, a bucket is said to consist of s slots, if it can hold s number of records in it.
A function that is used to compute the address of a record in the hash table, is known as hash function.
Usually s = 1 and in this case each bucket can hold exactly 1 record.
Hashing Function
A hashing function f, transforms an identifier X into a bucket address in the hash table. The address
so computed is known as hash address of the identifier X. If more than one record have same hashing
address, they are said to collide. This phenomenon is called address collision.
The desired properties of a hashing function are that it should be easily computable and that it should
minimizes the number of collisions.
A Uniform Hash Function is a hashing function in which probability that f(X) = i is 1/b, b being the
number of buckets in the hash table. In other words, each bucket has equal probability of being
assigned a record being inserted.
An ideal hash function distributes the stored keys uniformly across all the buckets so that every bucket has
the same number of records. Therefore, it is desirable to choose a hash function that assigns search key
values to buckets such that the following holds:
The distribution of key-values is uniform, that is, each bucket is assigned the same number of
search key values from the set of all possible search key values.
The distribution is random, that is, in the average case, each bucket will have nearly the same
number of values assigned to it, regardless of the actual distribution of search key values.
Several kinds of uniform hash functions are in use. We shall describe few of them:
Mid Square hash function
The middle of square function, fm, is computed by squaring the identifier and then using an appropriate
number of bits from the middle of the square to obtain the bucket address.
2. For More Visit: Https://www.ThesisScientist.com
Since the middle bits of the square will usually depend upon all of the characters in the identifier, it is
expected that different identifiers would result in different hash addresses with high probability even when
some of the characters in the identifiers are the same.
The number of bits to be used to obtain the bucket address depends on the table size. If r bits are used to
compute hash address, the range of values is 2r
, so the size of hash table is chosen to be a power of 2 when
this kind of scheme is used. Conversely, if the size of the hash table is 2r,
then the number of bits to be
selected from the middle of the square will be r.
Mid-square hash address( X ) = r number of middle digits of( X2
)
Example: Let the hash table size be 8 slots.; s=1 ;and let X be an identifier from a set of identifiers. Y
be the unique numerical value identifying X. Computation of mid-square hash function is carried out
as follows:
Hash table size = 8 = 23
r = 3
X Y Y2
Binary(Y2
) Mid-Sq(Y2)
A1 1 1 00 000 01 000(0)
A7 7 49 01 100 01 100(4)
A8 8 64 10 000 00 000(0)
A2 2 4 00 001 00 001(1)
A6 6 36 01 001 00 001(1)
A5 5 25 00 110 01 110(6)
A4 4 16 00 100 00 100(4)
A3 3 9 00 010 01 010(2)
We see that there is hash collision (hash clash) for the keys A1and A8, A7 and A4, A2 and A6.
Division hash function
Another simple choice for a hash function is obtained by using the modulo (mod) operator. The identifier X
is divided by some number M and the remainder is used as the hash address of X.
fD (x) = X mod M
This gives bucket address in the range 0 - (M-1) and so the hash table is at least of size b = m. M should be
prime number such that M does not divide rk + a where r is the radix of the character set and k and a are
very small numbers.
Example :Given a hash table with 10 buckets, what is the hash key for 'Cat'?
Since 'Cat' = 131130 when converted to ASCII, then x = 131130.
We are given the table size (i.e., m = 10, starting at 0 and ending at 9).
3. For More Visit: Https://www.ThesisScientist.com
f(x) = x mod m
f(131130) = 131130 mod 10
= 0
'Cat' is inserted into the table at address 0.
The Division method is distribution-independent.
The Multiplication Method
It multiplies of all the individual digits in the key together, and takes the remainder after dividing the
resulting number by the table size.
f(x) = (a * b * c * d *....) mod m
Where: m is the table size, a, b, c, d, etc. are the individual digits of the key.
Folding hash function
In this method identifier X is partitioned into several parts, all but the last being of the same length. These
parts are then added together to obtain the hash address for X. There are two ways to carry out this addition.
f(X) = (P1 + P2 +…..Pn) mod (hash-size)
In the first, all but the last part are shifted so that the least significant bit of each part lines up with the
corresponding bit of the last part. The different parts are now added together to get f (x).
P1 P2 P3 P4 P5 P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20
P1
P2
P3
P4
P5
123
203
241
112
20
699
P1 P2 P3 P4 P5 P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20
P1
P2
P3
P4
P5
123
203
241
112
20
699
Figure 9.1: Shift Folding
This method is known as shift folding. The other method of adding the parts is folding at the boundaries. In
this method the identifier is folded at the part boundaries and digits falling into the same position are added
together.
4. For More Visit: Https://www.ThesisScientist.com
P1
P2
P3
P4
P5
123
302
241
211
20
897
P1
P2
P3
P4
P5
123
302
241
211
20
897
Figure 9.2: Folding at Boundaries Pir = Reverse of Pi
Terms Associated with Hash Tables
Identifier Density
The ratio n/T is called the identifier density, where
n = number of identifiers
T = total number of possible identifiers.
The number of identifiers, n, in use is usually several orders of magnitude less than the total
number of possible identifiers, T.
The number of buckets b, in the hash table are also much less than T.
Loading Factor
The loading factor is equal to n/sb, where
s = number of slots in a bucket
b = total number of bucket
The number of buckets b is also very less than total number of possible identifiers, T.
Synonyms
The hash function f almost always maps several different identifiers into the same bucket. Two identifiers
I1, I2 are said to be synonyms with respect to f if f(I1) = f(I2). Distinct synonyms are entered into the same
bucket so long as all the s slots in that bucket have not been used.
Collision
A collision is said to occur, when two non-identical identifiers are hashed into the same bucket. When the
bucket size is 1, collision and overflow occurs simultaneously.
5. For More Visit: Https://www.ThesisScientist.com
Bucket Overflow
So far we have assumed that, when a record is inserted, the bucket to which it is mapped has available
space to store the record. If the bucket does not have enough space, however, it indicates an error condition
called Bucket Overflow.
Bucket overflow can occur due to several reasons
Insufficient Buckets: The number of buckets which we denote by nb, must be chosen such that nb>nr/fr,
where nr denotes the total number of records that will be stored and fr denotes the number of records that
will fit in a bucket. If the condition is not met, there will be less number of buckets than required and hence
will cause bucket overflow.
Handling of bucket Overflows
When situation of overflow occurs it should be resolved and the records must be placed somewhere else in
the table, i.e. an alternative hash address must be generated for these entries. The resolution should aim at
reducing the chances of further bucket flows.
Some of the approaches used for overflow resolution, are describe here:
Over Flow Chaining or Closed Hashing
In this approach, whenever a bucket overflow occurs, a new bucket (called over-flow bucket) is attached to
the original bucket through a pointer. If the attached bucket is also full, another bucket is attached to this
bucket. The process continues. All the overflow buckets of a given bucket are chained together in a linked
list. Overflow handling using such a linked list is called Overflow Chaining.
As an example, let us take an array of pointers as Hash table (Figure 9.3).
Figure 9.3: A Chained Hash Table
6. For More Visit: Https://www.ThesisScientist.com
Advantages of Chaining
1) Space Saving
Since the hash table is a contiguous array, enough space must be set-aside at compilation time to
avoid overflow. On the other hand, if the hash table contains only the pointers to the records, then
the size of the hash table may be reduced.
2) Collision Resolution
Chaining allows simple and efficient collision handling. To handle collision only a link field is to
be added.
3) Overflow
It is no longer necessary that the size of the hash table exceed the number of records. If there are
more records than entries in the table it means that some of the linked lists serve the purpose of
containing more than one record.
4) Deletion
Deletion proceeds in exactly the same way as deletion from a simple linked list. So in chained hash
table deletion becomes a quick and easy task.
Rehashing or Open Hashing
The form of hash structure that we have just described is sometimes referred to as closed hashing. Under an
alternate approach, called open hashing, the set of buckets is fixed and there are no overflow chains, instead
if a bucket is full, records are inserted in some other bucket in the initial set of buckets B.
Rehashing techniques essentially employ applying and, if necessary, re-applying some hash function again
and again until an empty bucket is found.
Rehashing, involves using a secondary hash function on the hash key of the item. The rehash function is
applied successively until an empty position is found where the item can be inserted. If the hash position of
the item is found to be occupied during a search, the rehash function is again used to locate the item.
Double Hashing
This is another method of collision resolution, but unlike the linear collision resolution, double hashing uses
a second hashing function, which normally limits multiple collisions. The idea is that if two values hash to
the same spot in the table, a constant can be calculated from the initial value using the second hashing
function, which can then be used to change the sequence of locations in the table, but still have access to the
entire table.
It consists of two rehashing functions f1 and f2. First of all f1 is applied to get the location for insertion. If it
occupied then f2 is used to rehash. If again there is a collision, f1 is used for rehashing. This way
alternatively each function is employed until the empty location is obtained.
Example
Let us suppose we have to insert key value 23763 by division hashing function in a table of size 10. The
two functions are f1(X) = (X + 1 )mod tablesize and f2(X) = 2 + X % tablesize.
We apply the first function to compute the first hash index:
7. For More Visit: Https://www.ThesisScientist.com
f1(23763) = (1+23763) mod 10 = 4. Let us suppose 4th
location is not free. Apply the rehashing function:
f2(f1(23763))=f2(4)= 2 + 4 mod 10 = 6 If 6th
place is also not empty, continue:
f1(f2(f1(23763))) = f1(6) = (1 + 6) mod 10 = 7. and so on.
Key Dependent Increments
In key dependent increments we assume a function, which can be part of key itself.
For example: We could truncate the key to a single character and use its code as the increment.
Bucket 0
Bucket 1
e-215
Bucket 2
e-101
e-110
Bucket 3
e-217
e-102
Bucket 4
e-218
Bucket 6
Bucket 5
e-203
e-218RISHABHLUCKNOW
e-203AJITJAIPUR
e-102SAURABHMUMBAI
e-215ARUNMADURAI
e-110SHARADDELHI
e-101CHANDRABANGLORE
e-217GAURAVCHENNI
e-218RISHABHLUCKNOW
e-203AJITJAIPUR
e-102SAURABHMUMBAI
e-215ARUNMADURAI
e-110SHARADDELHI
e-101CHANDRABANGLORE
e-217GAURAVCHENNI
Figure 9.4
8. For More Visit: Https://www.ThesisScientist.com
ISAM
IBM’s Indexed Sequential Access Method (ISAM) is an index based file access mechanism. It uses hashing
for the purpose of indexing of files.
For large files, it is very difficult and costly to read the file from secondary storage into the main memory.
Secondary storage, being slow, the access is very inefficient. To make file access more time and space
efficient hashing is used. There are many implementations in different operating systems. IBM’s technique
is ISAM. It is multi level hashing structure of index files.
It uses small master index that points to disk blocks of a secondary index. The secondary index blocks point
to the actual file blocks. The file is kept sorted on a defined key. To find a particular item, we first make a
binary search of the master index, which provides the block number of the secondary index. This block is
read in, and again a binary search is used to find the block containing the desired record. Finally, this block
is searched sequentially. In this way, any record can be located from its key by at most two direct-access
reads.
Figure 20.5 ISAM implementation
Master
index
Secondary
index
Data file