3. Basic Concepts
In a hashed search, the key, through an algorithmic function, determines the location of
the data.
We use a hashing algorithm to transform the key into the index that contains the data we
need to locate.
Another way to describe hashing is as a key-to-address transformation in which the keys
map to addresses in a list.
Hashing is a key-to address mapping process
4.
5.
6.
7. The address produced by the hashing algorithm is known as the home address.
We call the set of keys that hash to the same location in our list synonyms.
A collision occurs when a hashing algorithm produces an address for an
insertion key and that address is already occupied.
The address produced by the hashing algorithm is known as the home
address.
The memory that contains all of the home addresses is known as the prime
area.
Each calculation of an address and test for success is known as a probe.
8.
9. Hashing Methods
There are eight hashing methods they are:
Direct method
Substraction method
Modulo-division
Midsquare
Digit extraction
Rotation
Folding
Pseudorandom generation
10.
11. Direct Method:
In direct hashing the key is the address without any algorithmic manipulation.
Direct hashing is limited, but it can be very powerful because it guarantees
that there are no synonyms and therefore no collision.
12.
13. Subtraction Method
Sometimes keys are consecutive but do not start from 1.
Example:
A company may have only 100 employees, but the employee numbers start from
1001 and go to 1100.
In this case we use subtraction hashing, a very simple hashing function that
subtracts 1000 from the key to determine the address.
The direct and subtraction hash functions both guarantee a search effort of one
with no collisions.
They are 'one-to-one hashing methods: only one key hashes to each address.
14. Modulo-Division/Method:
Also known as division remainder, the modulo-division method divides the key by
the array size and uses the remainder for the address.
This method gives us the simple hashing algorithm shown below in which listSize is
the number of elements in the array:
Address = key MODULO listSize
17. Digit-Extraction Method:
Using digit extraction selected digits are extracted from the key and used as the address.
Example:
Using our six-digit employee number to hash to a three digit address (000-999)
We could select the first, third, and fourth digits (from the left) and use them as the
address.
379452 -> 394
121267 -> 112
378845 -> 388
160252 -> 102
045128 -> 051
18. Mid Square Method
In mid square hashing the key is squared and the address is selected from the
middle of the square number.
Limitation is the size of the key.
Example:
94522 = 89340304: address is 3403
379452: 379 * 379 = 143641 -> 364
121267: 121 * 121 = 014641 -> 464
378845: 378 * 378 = 142884 -> 288
160252: 160 * 160 = 025600 -> 560
045128: 045 * 045 = 002025 -> 202
The same digits must be selected from the product.
19. Folding Method
Two folding methods are used they are:
Fold shift
Fold boundary
Fold Shift
In fold shift the key value is divided into parts whose size matches the size of the
required address.
Then the left and right parts are shifted and added with the middle part.
Fold boundary
In fold boundary the left and right numbers are folded on a fixed boundary between them
and the center number.
The two outside values are thus reversed.
20.
21. Rotation Method
Rotation method is generally not used by itself but rather is incorporated in
combination with other hashing methods.
It is most useful when keys are assigned serially.
A simple hashing algorithm tends to create synonyms when hashing keys are
identical except for the last character.
Rotating the last character to the front of the key minimizes this effect.
22.
23. Pseudorandom method
A common random-number generator is shown below.
y= ax + c
To use the pseudorandom-number generator as a hashing method, we set x to the
key, multiply it by the coefficient a, and then add the constant c.
The result is then divided by the list size, with the remainder being the hashed
address.
Example:
Y= ((17 * 121267) + 7) modulo 307
Y= (2061539 + 7) modulo 307
Y= 2061546
Y=41
24. Hashing algorithm
The hashing methods may work well when we hash a key to an address in an array,
hashing to large files is generally more complex.
We have an alphanumeric key consisting of up to 30 bytes that we need to hash into a
32-bit address.
Step 1: To convert alphanumeric key into a number key by adding the American
Standard Code for Information Interchange (ASCII) value for each character to an
accumulator that will be the address.
Step 2: As each character is added, we rotate the bits in the address to maximize the
distribution of the values.
Step 3: After the characters in the key have been completely hashed, we take the
absolute value of the address and then map it into the address range for the file.
25.
26. Analysis
First:
The rotation can often be accomplished by an assembly language instruction.
If the algorithm is written in a high-level language, then the rotation is accomplished by
a series of bitwise and instructions.
For out purposes, it is sufficient that the 12 bits at the end of the address are shifted to
be the 12 bits at the beginning of the address and the bits at the beginning are shifted
to occupy the bit locations at the right.
Second:
This algorithm actually uses three of the hashing methods.
Finally, we use modulo division when we map the hashed address into the range of
available addresses.
27. Collision Resolution
With the exception of the direct and subtraction methods, none of the methods
used for hashing are one-to-one mapping.
Thus, when we hash a new key to an address, we may create a collision.
A collision occurs when a hashing algorithm produces an address for an insertion
key and that address is already occupied.
There are several methods for handling collisions, each of them independent of
the hashing algorithm.
28.
29. Concepts
The load factor of a hashed list is the number of elements in the list divided
by the number of physical elements allocated for the list, expressed as a
percentage.
Traditionally, load factor is assigned the symbol alpha (α).
The formula in which k repesents the number of filled elements in the list and
n represents the total number of elements allocated to the list is
a = ( k / n ) * 100
30. Computer scientists have identified two distinct types of clusters.
(i) Primary clustering occurs when data cluster around a home address.
Primary clustering is easy to identify.
(ii) Secondary clustering occurs when data become grouped along a collision
throughout a list. This type of clustering is not easy to identify.
There are two different approaches to resolving collisions:
Open addressing
Linked lists.
31. Open Addressing
The first collision resolution method, open addressing, resolves collisions in the
prime area-that is, the area that contains all of the home addresses.
When a collision occurs, the prime area addresses are searched for an 0 or
unoccupied element where the new data can be placed.
32. Linear Probe
In a linear probe, which is the simplest, when data cannot be stored in the home
address we resolve the collision by adding 1 to the current address.
However, this address is also filled.
We therefore add another 1 to the address and this time find an empty location.
Advantages:
First: they are quite simple to implement.
Second: data tend to remain near their home address.
33.
34. Quadratic Probe
Primary clustering, although not necessarily secondary clustering, can be
eliminated by adding a value other than 1 to the current address.
One easily implemented method is to use the quadratic probe.
Disadvantage:
It is time required to square the probe number.
We can eliminate the multiply factor, however, by using an increment factor that
increases by 2 each probe.
Adding the increment factor to the previous increment gives us the next
increment.
The quadratic probe has one limitation:
It is not possible to generate a new address for every element in the list.
35.
36. Pseudo random Collision Resolution
The last two open addressing methods ( Linear Probe and Quadratic Probe) methods are
collectively known as double hashing.
In each method, rather than using an arithmetic probe function, the address is rehashed.
Pseudorandom collision resolution uses a pseudorandom number to resolve the collision.
We now use it a collision resolution method. In this case, rather than use the key as a
factor in the random-number calculation, we use the collision address.
We now resolve the collision using the following pseudorandom-number generator, where
a is 3 and c is 5:
Y = (ax + c) modulo listSize
= ( 3 * 1 + 5) Modulo 397
= 8
37. Key Offset
Double hashing method that produces different collision paths
for different keys
Pseudorandom number generator produces a new address as a
function of previous address, key offset calculates the new
address as function of old address and key
offset = [ key/listsize]
address = ((offset + old address) modulo listSize)
Example
When key is 166702 and list size is 307 using modulo
division hashing method generates address of 1
offset = [166702/307] = 543
address = ((543+001) modulo 307) =237
38. Key Offset
If 237 were a collision, repeat the process to locate the next
address
offset = [166702/307] = 543
address = ((543+237) modulo 307) =166
Key Home
address
Key
offset
Probe 1 Probe 2
166702 1 543 237 166
572556 1 1865 024 047
067234 1 219 220 132
39. Linked list Collision Resolution
Major disadvantage to open addressing is that each collision
resolution increases the probability of future collisions
Eliminated by linked list approach
Linked list is ordered collection of data in which element
contains the location of next element
40. Linked List Collision Resolution
[000]
[001]
[002]
[003]
[004]
[005]
[006]
[007]
[008]
[305]
[306]
379452 Marry Dodd
070918 Sarah Trapp
121267 Bryan Devaux
378845 Patrick Linn
160252 Tuan Ngo
045128 Feldman
166702 Harry Eagle
572556 ChrisWalljasper
41. Linked list Collision Resolution
Use separate area to store collisions and chains together in
linked list
Two storage areas: prime area and overflow area
Each element in prime area contains additional field a link
header pointer to a linked list of overflow data in overflow
area
When collision occurs, one element is stored in prime area and
chained to corresponding linked list in over flow area
overflow area is typically implemented as linked list in
dynamic memory
42. Linked list Collision Resolution
Linked list is stored in any order, but LIFO sequence or key
sequence
LIFO sequence is fastest for insert because the linked list need
not be scanned to insert data
Element being inserted into overflow is placed at beginning of
linked list and linked to node in prime area
In key sequenced lists, key in prime area is smallest to provide
for faster search retrieval
43. Bucket Hashing
Keys are hashed to bucket nodes that accommodate multiple
data occurrences
Bucket hold multiple data, collisions are postponed until
bucket is full
Example
Each address is large enough to hold data for three employees
Collision will not occur until tried to add fourth employee to
address
Two problems
Use more space because many of bucket are empty or partially
empty at any time
It will not completely resolves collision problem
44. Bucket Hashing
379452 Marry Dodd
070918 Sarah Trapp
166702 Harry Eagle
367173 Ann Giorgis
121267 Byan Devaux
572556 Chris jasper
045128 Feldman
[000]
Bucket
0
[001]
Bucket
1
[002]
Bucket
2
[003]
Bucket
307
45. Combination Approaches
There are several approaches to resolving collisions.
As we saw with the hashing methods, a complex implementation often uses
multiple steps.
Example:
One large database implementation hashes to a bucket.
If the bucket is full, it uses a set number of linear probes, such as three, to
resolve the collision and then uses a linked list overflow area.