Hash tables

Sunawar Khan
Islamia University of Bahawalpur
Bahawalnagar Campus
Analysis o f Algorithm

‫خان‬ ‫سنور‬ Algorithm Analysis
Collision
Hash Table
Direct Address Table
Hash Functions
Collision Resolution
Contents

Dictionary
• Dictionary:
• Dynamic-set data structure for storing items indexed using keys.
• Supports operations Insert, Search, and Delete.
• Applications:
• Symbol table of a compiler.
• Memory-management tables in operating systems.
• Large-scale distributed systems.
• Hash Tables:
• Effective way of implementing dictionaries.
• Generalization of ordinary arrays.

Direct-address Tables
• Direct-address Tables are ordinary arrays.
• Facilitate direct addressing.
• Element whose key is k is obtained by indexing into the kth position of the array.
• Applicable when we can afford to allocate an array with one position for
every possible key.
• i.e. when the universe of keys U is small.
• Dictionary operations can be implemented to take O(1) time.
• Details in Sec. 11.1.

Hash Tables
• Notation:
• U – Universe of all possible keys.
• K – Set of keys actually stored in the dictionary.
• |K| = n.
• When U is very large,
• Arrays are not practical.
• |K| << |U|.
• Use a table of size proportional to |K| – The hash tables.
• However, we lose the direct-addressing ability.
• Define functions that map keys to slots of the hash table.

Hashing
• A range of key values can be transformed into a range of array
index values.
• A simple array can be used where each record occupies one cell
of the array and the index number of the cell is the key value for
that record.
• But keys may not be well arranged.
• In such a situation hash tables can be used.

Hashing
• Hash function h: Mapping from U to the slots of a hash table T[0..m–1].
h : U  {0,1,…, m–1}
• With arrays, key k maps to slot A[k].
• With hash tables, key k maps or “hashes” to slot T[h[k]].
• h[k] is the hash value of key k.

Hashing
0
m–1
h(k1)
h(k4)
h(k2)=h(k5)
h(k3)
U
(universe of keys)
K
(actual
keys)
k1
k2
k3
k5
k4
collision

Issues with Hashing
• Multiple keys can hash to the same slot – collisions are possible.
• Design hash functions such that collisions are minimized.
• But avoiding collisions is impossible.
• Design collision-resolution techniques.
• Search will cost Ө(n) time in the worst case.
• However, all operations can be made to have an expected complexity of Ө(1).

What is a Hash Table ?
• The simplest kind of hash
table is an array of records.
• This example has 701
records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
An array of records
. . .
[ 700]

• Each record has a special
field, called its key.
• In this example, the key is
a long integer field called
Number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548

• The number might be a
person's identification
number, and the rest of the
record has information
about the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548

• When a hash table is in use,
some spots contain valid
records, and other spots are
"empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .

Inserting a New Record
• In order to insert a new
record, the key must
somehow be converted to an
array index.
• The index is called the hash
value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 155778322
. . .
Number 580625685

• Typical way create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?

• Typical way to create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?
3

• The hash value is used for
the location of the new
record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 155778322
. . .
[3]

Writing a hash function
• If we write a hash table that can store objects, we need a hash function
for the objects, so that we know what index to store them
• We want a hash function to:
1. Be simple/fast to compute
2. Map equal elements to the same index
3. Map different elements to different indexes
4. Have keys distributed evenly among indexes

Hash Function
• Need to compress the huge range of numbers.
• arrayIndex = hugenumber % smallRange;
• This is a hash function.
• It hashes a number in a large range into a number in a smaller range,
corresponding to the index numbers in an array.
• An array into which data is inserted using a hash function later is called a
hash table.

Hash Functions
• A good hash function is simple and can be computed quickly.
• Speed degrades if hash function is slow.
• Purpose is to transform a range of key values into index values such that
the key values are distributed randomly across all the indices of the hash
table.
• Keys may be completely random or not so random.

Hash Functions
Hash Functions
• Direct Hashing
• Subtraction Method
• Modulo Division
• Digit Extraction Method
• Mid-Square Method
• Folding Method
• Rotation Method
• Pseudo Random Methods

Direct Hashing
• No algorithm Manipulation
• No Collision
• Limited
• Not Suitable for large key Values
KEY H(K) Address

Modulo Division Method
• Map a key k into one of the m slots by taking the remainder of k divided by
m. That is,
h(k) = k mod m where m is the list size
• Example: m = 31 and k = 78  h(k) = 16.
• Advantage: Fast, since requires just one division operation.
• Disadvantage: Have to avoid certain values of m.
• Don’t pick certain values, such as m=2p
• Or hash won’t depend on all bits of k.
• Good choice for m:
• Primes, not too close to power of 2 (or 10) are good.

Mid Square Hashing
• Def:- In mid-square hashing the key is squared and the address is selected
from the middle of the square number
• Address = (Middle Digits of Key)2
• Middle of Square
• Key is squared and the address is selected from the middle of the square
number
• Problem:- Non- Uniform distribution of the key.
Size of the key is too large
Variation:- Use only a portion of the key

Example
•Key = 9452 * 9452 = 89340304 = 3403
•Variation Explanation:-
•379452 :- 379 * 379 = 143641 = 364

Digit Extraction
• Using digit Extraction selected digits are extracted from the key and used
as the address.
:- Address = Selected digits from the key
Example:-
Using SIX digit employee number to hash to a three digit address, we
could select the first, third and fourth digits(from the left) and use them as
the address.
• The keys are
• 379452 = 394
• 121267 = 112
• 378845 = 388
• 045128 = 051
051 045128
112 121267
388 378845
394 379452

Folding Method for Hashing
• Break the key into groups of digits and add the groups.
• The number of digits in a group should correspond to the size of the array.
Fold-Shift Hashing Fold-Boundary Hashing
Key value is divided into parts whose size
matches the size required address
Left and right number are folded on a
fixed boundary between them the center
no.
K = 123456789{3}
123 + 456 + 789
123
456
789
1368
K = 123456789{3}
123 + 456 + 789
321
456
987
1764
H(k) = 368 H(k) = 764

Pseudo-Random Method
Key
Pseudo Random
Number Generator
Random Number Modulo Division Address
Key = y = ax + b
For maximum efficacy a and c should be prime numbers
Where
• Y is random number
• a is chosen co-efficient
• c is constant
H(k) = Y MOD n
Where n is list size

Random Keys
• If the world were perfect, Evenly distributed NOT!
• A perfect hash function maps every key into a different table location.
• In most cases large number of keys are compressed into a smaller range of
index numbers.
• Distribution of key values in a particular database determines what the
hash function needs to be.
• For random keys: index = key % arraySize;

Example
•Example:-
Key = 121267 a = 17 c = 7 listSize = 307
Address = ((17 * 121267) + 7 MOD 307 + 1
= (2061539 + 7) MOD 307 + 1
= 41 + 1
= 42

Subtraction Hashing
• Subtract a fixed number from key
• H(k) = k – c
• Where c is fixed number
• Like Direct Hashing
• No Collision
• Suitable for Small List

Non-random Keys
• Consider a number of the form 030-050-20-92-7.
• Every digit serves a purpose. The last 3 digits are redundant for error
checking.
• These digits shouldn’t be considered.
• Every part of the remaining key should contribute to the data.
• Use a prime number for the modulo base.

0
1 21
2
3
4 34
5
6
7 7
8 18
9
• Collision: the event that two hash table elements map into
the same slot in the array
• Example: add 41, 34, 7, 18, then 21
• 21 hashes into the same slot as 41!
• 21 should not replace 41 in the hash table;
they should both be there
Collision Resolution: a strategy for fixing collisions in a hash
table
Hash collisions

Collisions
• Two words can hash to the same array index, resulting in collision.
•Open Addressing: Search the array in some systematic way for an
empty cell and insert the new item there if collision occurs.
•Separate chaining: Create an array of linked list of words, so that the
item can be inserted into the linked list if collision occurs.

Collision Resolution Methods
Collision
Resolution
Open
Addressing
Linear
Probing
Quadratic
Probing
Pseudo
Random
Methods
Key Offset
Linked List Buckets

Open Addressing
• Three methods to find next vacant cell:
• Linear Probing :- Search sequentially for vacant cells, incrementing the
index until an empty cell is found.
• Clustering is a problem occurring in linear probing.
• As the array gets full, clusters grow larger, resulting in very long probe
lengths.
• Array can be expanded if it becomes too full.

Open Addressing
• Open Addressing is
• a collision resolution strategy
• on a collision, look for another empty spot in the array
• previous discussed examples are all examples of open addressing
• Linear probing
• Quadratic probing
• Double hashing
• Look-up for open addressing scheme must continue looking for item until it finds it
or an empty slot.

0
1 41
2 21
3
4 34
5
6
7 7
8 18
9 57
• linear probing: resolving collisions in slot i by putting the
colliding element into the next available slot (i+1, i+2, ...)
• add 41, 34, 7, 18, then 21, then 57
• 21 collides (41 is already there), so we search ahead until we find
empty slot 2
• 57 collides (7 is already there), so we search ahead twice until we find
empty slot 9
• lookup algorithm becomes slightly modified; we have to loop
now until we find the element or an empty slot
• what happens when the table gets mostly full?
Linear probing

0 49
1 58
2 9
3
4
5
6
7
8 18
9 89
• Clustering: nodes being placed close together by probing,
which degrades hash table's performance
• add 89, 18, 49, 58, 9
• now searching for the value 28 will have to check half the hash
table! no longer constant time...
Clustering problem

Quadratic Probing
• load factor = nItems / arraySize;
• If load factor isn’t high, clusters can form.
• In quadratic probing more widely separated cells are probed.
• The step is the square of the step number.
• If index is x, the probe goes to x+1, x+4, x+9, x+16 and so on.
• Eliminates primary clustering, but all the keys that hash to a particular cell
follow the same sequence in trying to find a vacant cell (secondary
clustering).

0 49
1
2 58
3 9
4
5
6
7
8 18
9 89
• Quadratic probing: resolving collisions on slot i by putting
the colliding element into slot i+1, i+4, i+9, i+16, ...
• add 89, 18, 49, 58, 9
• 49 collides (89 is already there), so we search ahead by +1 to empty slot
0
• 58 collides (18 is already there), so we search ahead by +1 to occupied
slot 9, then +4 to empty slot 2
• 9 collides (89 is already there), so we search ahead by +1 to occupied
slot 0, then +4 to empty slot 3
• clustering is reduced
• what is the lookup algorithm?
Quadratic probing

Double Hashing
• Better solution.
• Generate probe sequences that depend on the key instead of being the same for every
key.
• Hash the key a second time using a different hash function and use the result as the
step size.
• Step size remains constant throughout a probe, but its different for different keys.
• Secondary hash function should not be the same as primary hash function.
• It must never output a zero.
• stepSize = constant – (key % constant);
• Requires that size of hash table is a prime number.

0
1
2
3 49
4
5
6
7
8 18
9 89
• Double hashing:
• Pick a secondary hash function hash2().
• when hashing item x, resolving collisions on slot i by putting the
colliding element into slot i+hash2(x), i+2*hash2(x),
i+3*hash2(x), i+4*hash2(x), ...
• Suppose hash2(x) = (x / 10) % 10.
• add 89, 18, 49, 58: What happens?
• 49 collides (89 is already there); hash2(x) = 4, so check location i + 4
next; put 49 in slot 3.
• 58 collides (18 is already there); hash2(x) = 5, so check location i + 5
next; still occupied, check location i+2*5; still occupied!
• will remain still occupied forever!
• Fix this particular problem by using a prime # for your table size. Then
will visit all array entries eventually during probing.
• what is the lookup algorithm?
Double Hashing

Separate Chaining
• No need to search for empty cells.
• The load factor can be 1 or greater.
• If there are more items on the lists access time is reduced.
• Deletion poses no problems.
• Table size is not a prime number.
• Arrays (buckets) can be used at each location in a hash table instead of a
linked list.

0
1
2
3
4
5
6
7
8
9
10
107
22 12 42
• Chaining: All keys that map to the same hash value are kept in a linked list
Chaining

Hashing Efficiency
• Quadratic probing and Double Hashing share their performance
equations.
• For successful hashing : -log2(1 - loadFactor) / loadFactor
• For an unsuccessful search :- 1 / (1 - loadFactor)
• Searching for separate chaining :- 1 + loadFactor /2
• For unsuccessful search :- 1 + loadFactor
• For insertion :- 1 +loadfactor ?2 for ordered lists and 1 for unordered lists.

Open Addressing vs. Separate Chaining
• If open addressing is to be used, double hashing is preferred over
quadratic probing.
• If plenty of memory is available and the data won’t expand, then linear
probing is simpler to implement.
• If number of items to be inserted in hash table isn’t known, separate
chaining is preferable to open addressing.
• When in doubt use separate chaining

External Storage
• Hash table can be stored in main memory.
• If it is too large it can be stored externally on disk, with only part of it
being read into main memory at a time.
• In external hashing its important that the blocks do not become full.
• Even with a good hash function, the block might become full.
• This situation can be handled using variations of the collision-resolution
schemes.

• main operation: lookup of item in table
• What is worst-case cost of finding an item?
• assuming hash table e hash table has n items in it
• Is the worst-case cost different for chaining, and the various open
addressing schemes?
• Worst-case analysis doesn’t make sense for hash tables, look at average
case cost
• Cost highly depend on the load factor
Analysis of hash tables

• Load: the load  of a hash table is the ratio:
 no. of elements
 array size
• Average case analysis of search
• Assume hashCode distributes entries uniformly at random into various indices.
• Using chaining implementation
• What is the average list size?
• What does this imply about search times?
M
N
Analysis of hash table search

‫خان‬ ‫سنور‬ Algorithm Analysis51
• Average case analysis of search, with chaining:
• Count number of link traversals necessary.
• unsuccessful: 
(the average length of a list at hash(i))
• successful: 1 + (/2)
(one node, plus half the avg. length of a list)
• Analysis of open addressing schemes:
• Are more lookups or less lookups required for open addressing, on average?

‫خان‬ ‫سنور‬ Algorithm Analysis52
• Average case analysis of search, with linear probing:
• Number of lookups worse than chaining
• Complicated to analyze; done by Knuth [1962]
• unsuccessful: 
• successful: 







 2
)1(
1
1
2
1









1
1
1
2
1

Hashing Efficiency
• Insertion and searching can approach O(1) time.
• If collision occurs, access time depends on the resulting probe lengths.
• Individual insert or search time is proportional to the length of the probe.
This is in addition to a constant time for hash function.
• Relationship between probe length (P) and load factor (L) for linear
probing : P = (1+1 / (1 – L2)) / 2 for successful search and
P = (1 + 1 / (1 – L))/ 2

Exam Related Questions
• 11.1-1
Suppose that a dynamic set S is represented by a direct-address table T
of length m. Describe a procedure that ﬁnds the maximum element of S.
What is the worst-case performance of your procedure?
• Answer
Starting from the first index in T, keep track of the highest index so far
that has anon NIL entry. This takes time O(m).

11.1-2
A bit vector is simply an array of bits(0sand 1s). A bit vector of length m takes much
less space than an array of m pointers. Describe how to use a bit vector to represent a
dynamic set of distinct elements with no satellite data. Dictionary operations should
run in O.1/ time.
Answer
Start with a bit vector b which contains a 1 in position k if k is in the dynamic set, and
a 0 otherwise. To search, we return true if b[x]==1. To insert x, set b[x]=1. To delete
x, set b[x]=0. Each of the set takes O(1)time.

11.2-2
Demonstrate what happens when we insert the keys 5, 28, 19, 15, 20, 33, 12, 17, 10
into a hash table with collisions resolved by chaining. Let the table have 9 slots, and
let the hash function be h(k) = k MOD 9.
Answer
Label the slots of our table be 0, 1, 2, . . . , 8. numbers
which appear to the left in the table have been inserted
later.

11.3-1
Suppose we wish to search a linked list of length n, where each element contains a
key k along with a hash value h(k). Each key is a long character string. How might we
take advantage of the hash values when searching the list for an element with a given
key?
Answer
If every element also contained a hash of the long character string, when we are
searching for the desired element, we’ll first check if the hash value of the node in the
linked list, and move on if it disagrees. This can increase the runtime by a factor
proportional to the length of the long character strings.

11.3-2
Suppose that we has a string of r characters into m slots by treating it as a radix-128
number and then using the division method. We can easily represent the number m
as a 32-bit computer word, but the string of r characters, treated as a radix-128
number, takes many words. How can we apply the division method to compute the
hash value of the character string without using more than a constant number of
words of storage outside the string itself?
Answer
Compute the value of the first character mod m, add the value of the second
character mod m, add the value of the third character mod m to that, and so on, until
all r characters have been taken care of.

11.4-2
Write pseudo code for HASH-DELETE as outlined in the text, and modify HASH-
INSERT to handle the special value DELETED.
Answer
i = HASH SEARCH(T;k)
T[i]= DELETED
We modify INSERT by changing line 4 to: if T[j]== NIL or T[j]== DELETED.
Algorithm1 HASH-DELETE(T,k)

Hash tables

More Related Content

What's hot

Similar to Hash tables

More from International Islamic University

Recently uploaded

Hash tables