3. WHAT IS HASHING?
Hashing is an algorithm (via a hash function) that maps large data sets of
variable length, called keys, to smaller data sets of a fixed length.
A hash table (or hash map) is a data structure that uses a hash function to
efficiently map keys to values, for efficient search and retrieval.
Widely used in many kinds of computer software, particularly for associative
arrays, database indexing, caches, and sets
4. Different data structures to realize a key
Binary Tree
Array , Linked List
AVL Tree
B-Tree
Hash Table
5. Hash Table
A hash table is a data structure that stores elements and allows insertions, lookups, and deletions
to be performed in O(1) time.
A hash table is an alternative method for representing a dictionary
In a hash table, a hash function is used to map keys into positions in a table. This act is
called hashing
Hash Table Operations
Search: compute f(k) and see if a pair exists
Insert: compute f(k) and place it in that position
Delete: compute f(k) and delete the pair in that position
In ideal situation, hash table search, insert or delete takes (1)
6. How Does it Work?
The hash table part is just an ordinary array, it is the Hash that we are interested in.
The Hash is a function that transforms a key into address or index of array(table) where the record will
be stored. If the size of the table is N, then the integer will be in the range 0 to N -1. The integer is used
as an index into the arr ay. Thus, in essence, the key itself indexes the array.
If h is a hash function and k is key then h(k) is called the hash of the key and is the index at which a
record with the key k should be placed.
The hash function generates this address by performing some simple arithmetic or logical operations
on the key.
7. Why Hashing?
The sequential search algorithm takes time proportional to the data size, i.e, O (n).
Binary search improves on liner search reducing the search time to
O (log n).
With a BST, an O (log n) search efficiency can be obtained; but the worst-case complexity is O (n).
To guarantee the O(log n) search time, BST height balancing is required ( i.e., AVL trees).
8. Why Hashing?
Suppose that we want to store 10,000 students records (each with a 5-digit ID) in a given container.
A linked list implementation would take O (n) time.
A height balanced tree would give O (log n) access time.
Using an array of size 100,000 would give O (1) access time but will lead to a lot of space wastage.
Is there some way that we could get O (1) access without wasting a lot of space?
Yes, the answer is hashing.
9. What is Hash Function?
Suppose we have a hash table of size N.
Keys are used to identify the data .
A hash function is used to compute a hash value.
A hash value (hash code) is:
Computed from the key with the use of a hash function to get a number in the
range 0 to N − 1
Used as the index (address) of the table entry for the data
Regarded as the “home address” of a key .
Desire: The addresses are different and spread evenly over the range
When two keys have same hash value — collision
10. Good Hash Functions
Fast to compute, O( 1 )
Scatter keys evenly throughout the hash table
Less collisions
Need less slots (space)
The hash function uses all the input data.
The hash function generates very different hash values
for similar strings.
11. Perfect Hash Functions
Perfect hash function is a one-to-one mapping between keys and hash values. So no collision
occurs .
Possible if all keys are known.
Applications: compiler and interpreter search for reserved words; shell interpreter searches
for built-in commands.
Minimal perfect hash function: The table size is the same as the number of keywords supplied
.
12. What is Linear Probing?
In this section we will see what is linear probing technique in open addressing scheme.
There is an ordinary hash function h´(x) : U → {0, 1, . . ., m – 1}.
In open addressing scheme, the actual hash function h(x) is taking the ordinary hash function
h’(x) and attach some another part with it to make one linear equation.
Suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing
fashion. The elements are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}
13.
14. Hash Table
Linear probing, we linearly probe for next slot. For example, the
typical gap between two probes is 1 as taken in below example
also.
15. Let us consider a simple hash function as “key mod 7”
and sequence of keys as 50, 700, 76, 85, 92, 73, 101.
16. Challenges in Linear Probing
1. Primary Clustering: One of the problems with linear probing is
Primary clustering, many consecutive elements form groups and
it starts taking time to find a free slot or to search an element.
2. Secondary Clustering: Secondary clustering is less severe, two
records do only have the same collision chain (Probe Sequence)
if their initial position is the same.
17. What is Double Hashing?
Double hashing technique in open addressing scheme.
There is an ordinary hash function h´(x) : U → {0, 1, . . ., m – 1}.
In open addressing scheme, the actual hash function h(x) is taking the ordinary hash function
h’(x) when the space is not empty ,then perform another hash function tp get some space to
insert.
h1(x)=xmodmh1(x)=xmod m
h2(x)=xmodm′h2(x)=xmod m′
h(x,i)=(h1(x)+ih𝑥2)mod m
The value of i = 0, 1, . . ., m – 1. So we start from i = 0, and increase this until we get one
free space. So initially when i = 0, then the h(x, i) is same as h´(x).
18. What is Double Hashing?
Suppose we have a list of size 20 (m = 20).
We want to put some elements in linear probing fashion.
The elements are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}
h1(x)=xmod20h1(x)=xmod20
h2(x)=xmod13h2(x)=xmod13
x h(x ,i) = (h1 (x) + ih2(x)) mod 20
21. COMMON HASHING FUNCTIONS
Some common hashing algorithms include:
MD5 (Message Digest algorithm)
SHA-1 (Secure Hash Algorithm-1)
SHA-2 (Secure Hash Algorithm-2)
NTLM (NT LAN Manager)
LANMAN.( LAN Manager)
22. COLLISION
● Since a hash function gets us a small number for a key which is a big integer or string,
there is a possibility that two keys result in the same value.
● The situation where a newly inserted key maps to an already occupied slot in the hash
table is called collision.
● Collision must be handled for efficient implementation and performance of hash functions
and for us to perform the basic operations of searching, adding and deletion.
23. Example
A typical example of collision is shown in the image below where keys map to the same hash
value after calculation by the hash function.
24. Collision resolution
There are mainly two methods to handle collision:
1) Separate Chaining: The idea is to make each cell of hash table point to a linked list of
records that have same hash function value.
2) Open Addressing:. In Open Addressing, all elements are stored in the hash table itself. So
at any point, the size of the table must be greater than or equal to the total number of keys.
25. Collision Resolution by Chaining.
● In chaining, each location in a hash table stores a pointer to a linked list that
contains all the key values that were hashed to that location.
● That is, location l in the hash table points to the head of the linked list of all the
key values that hashed to l. However, if no key value hashes to l, then location l
in the hash table contains NULL.
● Figure below shows how the key values are mapped to a location in the hash
table and stored in a linked list that corresponds to that location.
27. Operations on a Chained Hash Table
• Searching for a value in a chained hash table is as simple as scanning a linked list for an
entry with the given key.
• Insertion operation appends the key to the end of the linked list pointed by the hashed
location.
• Deleting a key requires searching the list and removing the element.
• Chained hash tables with linked lists are widely used due to the simplicity of the algorithms
to insert, delete, and search a key.
28. Efficiency:
• The time complexity of inserting a key in a chained hash table is O(1).
• The cost of deleting and searching a value is given as O(m) where m is the number of
elements in the list of that location.
• Searching and deleting takes more time because these operations scan the entries of the
selected location for the desired key.
• In the worst case, searching a value may take a running time of O(n), where n is the
number of key values stored in the chained hash table.
• This case arises when all the key values are inserted into the linked list of the same
location (of the hash table).
29. Code to initialise chained hash table:
typedef struct node_HashTable {
int value;
struct node *next;
}node;
void initialiseHashTable (node *hash_table[], int m)
{ int i;
for(i=0i<=m;i++)
hash_table[i]=NULL;
}
Time complexity: O(m)
30. Code to insert a value
/* The element is inserted at the beginning of the linked list whose pointer to its head is
stored in the location given by h(k). The running time of the insert operation is O(1), as the
new key value is always added as the first element of the list .*/
node *insert_value( node *hash_table[], int val)
{ node *new_node;
new_node = (node *)malloc(sizeof(node));
new_node value = val;
new_node next = hash_ table[h(x)];
hash_table[h(x)] = new_node;
}
31. Searching a value:
The element is searched in the linked list whose pointer to its head is stored in the location
given by h(k).
If search is successful, the function returns a pointer to the node in the linked list; otherwise
it returns NULL.
The worst case running time of the search operation is given as order of size of the linked
list.
32. Code to search a value
node *search_value(node *hash_table[], int val)
{
node *ptr; ptr = hash_table[h(x)];
while ( (ptr!=NULL) && (ptr –> value != val)){
ptr = ptr –> next;
}
if (ptr–>value == val) return ptr;
else return NULL;
}
33. Deleting a value:
● To delete a node from the linked list whose head is stored at the location given by h(k) in
the hash table, we need to know the address of the node’s predecessor.
● To do this we need a pointer saver.
● The running time complexity of the delete operation is same as that of the search
operation because we need to search the predecessor of the node so that the node can be
removed without affecting other nodes in the list.
34. Code to delete a value
void delete_value (node *hash_table[], int val)
{
node *save, *ptr;
save = NULL;
ptr = hash_table[h(x)];
while ((ptr != NULL) && (ptr value != val))
{
save = ptr; ptr = ptr next;
}
if (ptr != NULL)
{ save next = ptr next;
free (ptr);
} else
printf("n VALUE NOT FOUND"); }
35. Advantages of chaining
• Simple to implement.
• Hash table never fills up, we can always add more elements to the chain.
• Less sensitive to the hash function or load factors.
• It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
36. Disadvantages of chaining
● Cache performance of chaining is not good as keys are stored using a linked list. Open
addressing provides better cache performance as everything is stored in the same table.
● Wastage of Space (Some Parts of hash table are never used).
● If the chain becomes long, then search time can become O(n) in the worst case.
● Uses extra space for links.
37. Open addressing technique:
• In Open Addressing, all elements are stored in the hash table itself. So at any point, the
size of the table must be greater than or equal to the total number of keys.
• Insert(k): Keep probing until an empty slot is found. Once an empty slot is found, insert k.
• Search(k): Keep probing until slot’s key doesn’t become equal to k or an empty slot is
reached.
• Delete(k): If we simply delete a key, then the search may fail. So slots of deleted keys are
marked specially as “deleted”.
• The insert can insert an item in a deleted slot, but the search doesn’t stop at a deleted
slot.
38. Hash Buckets:
In computing, a hash table [hash map] is a data structure that provides virtually direct
access to objects based on a key [a unique String or Integer]. A hash table uses a hash
function to compute an index into an array of buckets or slots, from which the desired
value can be found. Here are the main features of the key used:
● The key used can be your SSN, your telephone number, account number, etc
● Must have unique keys
● Each key is associated with–mapped to–a value
● Hash buckets are used to apportion data items for sorting or lookup purposes. The aim of this
work is to weaken the linked lists so that searching for a specific item can be accessed within a
shorter time frame
39.
40. Hash Buckets:
• In case a bucket is completely full, the record will get stored in an
overflow bucket of infinite capacity at the end of the table.
• All buckets share the same overflow bucket
However, a good implementation will use a hash function that distributes
the records evenly among the buckets so that as few records as possible go
into the overflow bucket.
.
41. Bucket Hashing:
Closed hashing stores all records directly in the hash table. Each record R with key
value kR has a home position that is h(kR), the slot computed by the hash function.
If R is to be inserted and another record already occupies R's home position, then R
will be stored at some other slot in the table.
It is the business of the collision resolution policy to determine which slot that will be.
Naturally, the same policy must be followed during search as during insertion, so that
any record not found in its home position can be recovered by repeating the collision
resolution process.
.
42. Hash Bucket:
One implementation for closed hashing groups hash table slots into buckets. The M slots of the hash
table are divided into B buckets, with each bucket consisting of M/B slots. The hash function assigns
each record to the first slot within one of the buckets.
If this slot is already occupied, then the bucket slots are searched sequentially until an open slot is
found. If a bucket is entirely full, then the record is stored in an overflow bucket of infinite capacity at
the end of the table. All buckets share the same overflow bucket.
A good implementation will use a hash function that distributes the records evenly among the buckets
so that as few records as possible go into the overflow bucket.
When searching for a record, the first step is to hash the key to determine which bucket should contain
the record. The records in this bucket are then searched. If the desired key value is not found and the
bucket still has free slots, then the search is complete.
.
43. Hash Buckets:
If the bucket is full, then it is possible that the desired record is
stored in the overflow bucket.
In this case, the overflow bucket must be searched until the record
is found or all records in the overflow bucket have been checked. If
many records are in the overflow bucket, this will be an expensive
process.
.
44. Methods:-
Bucket methods are good for implementing hash tables stored on disk, because
the bucket size can be set to the size of a disk block. Whenever search or
insertion occurs, the entire bucket is read into memory. Because the entire
bucket is then in memory, processing an insert or search operation requires only
one disk access, unless the bucket is full. If the bucket is full, then the overflow
bucket must be retrieved from disk as well. Naturally, overflow should be kept
small to minimize unnecessary disk accesses.
.
45. Collision Resolution
Bucket hashing is treating the hash table as a two dimensional array instead of a
linear array.
Consider a hash table with S slots that are divided into B buckets, with each
bucket consisting of S/B slots. The hash function assigns each record to the first
slot within one of the buckets. If the slot was already occupied then the bucket
slots are searched sequentially until an empty slot is found. If the bucket is
completely full, the record will be stored in an overflow bucket of infinite
capacity at the end of the table, which is shared by all buckets. Which makes
bucket hashing a form of closed hashing implementation. An ideal implementation
will use a hash function that distributes the records evenly among all buckets so
there will be as few records as possible to store in the overflow bucket.
.
.
46. Collision Resolution
Given this bucket hash table for an array of size 10 storing 5
buckets, each bucket having two slots in size, let's demonstrate
how this method works in practice. We also have an overflow
bucket of infinite size on the right to store records when the
buckets in the main hash table are occupied. I will be using mod
operation as the hash function.
.
.
47. Collision Resolution
. Let us start by inserting the number 18 as our first record. Since we
have 5 buckets, we take mod 5. 18 % 5 is 3. We put this into the top of
B3, which is slot 6 of the hash table.
Now inserting a record for 30. 30 % 5 is 0. 30 goes into B0[0].
Next we insert a record for 38; 38 % 5 is 3 so it will be placed in B3[1].
Next up we have 48. 48 % 5 is 3, but the B3 is already full, hence we
store 48 in the first available slot of our overflow bucket.
We can now try with 20. 20 % 5 is 0; B0[0] is occupied hence it will be
stored in B0[1].
Now if we insert 25, 25 % 5 is 0 and we know both slots of B0 are
occupied now, hence it will end up in our overflow bucket.
.
48.
49. When looking for a record, we first take its hash value and search the resulting bucket.
If we search for key value 20, we search in B0, first checking B0[0] which holds a
different value, so we check B0[1] and we find our key.
When searching for the key value 25, we look in B0 sequentially. We see it doesn't hold
our key value and it is full, hence we look through the overflow bucket. First checking
OB[0], then OB[1] and we have found it.
Note that if there are many records in the overflow bucket, this will be an expensive
process.