-V.Dinesh
III/IV CSE-B
ANITS
Hashing
Outline
 11.1 Introduction
– 11.1.1 What is Hashing
– 11.1.2 Collisions
 11.2 A Simple Hashing Algorithm
 11.3 Hashing Functions and Record Distributions
– 11.3.1 Distributing Records among Addressing
– 11.3.2 Some Other Hashing Methods
– 11.3.3 Predicting the Distribution of Records
– 11.3.4Predicting Collisions for a Full File
11.1.1 What is Hashing
 Hashing is the transformation of a string of characters into a
usually shorter fixed-length value that represents the original
string.
 Hashing is used to index and retrieve items in a database.
 It is so fast that it will take O(1) time to search an element.
How hashing is done?
 It uses Hash Function.
– Takes Key (K) as an argument.
– Return Address (home address).
 Hash Table
– It is a datastructure similar to array.
– The key is placed in the home address of hash table.
Hash Function
int MyHash(char *key)
{
int d;
d=(key[0]*key[1])%1000;
return d;
}
Searching , Insertion and Deletion
Let us consider a hash table H[1000].
 Inserting :
– H[MyHash(key)]=key;
 Deleting:
– H[MyHash(key)]=‘0’;
 Searching:
– H[MyHash(key)];
Insertion
Key=“DINESH”
MyHash(Key)
60
61
62
63
64
65
66
DINESH
Name Home Address
DINESH 64
BALL 290
TREE 888
11.1.2 Collisions
 Now consider the Key “IDIOT”. The home address for this key is
64 which is same as home address of “DINESH”.
 The fighting between two different keys for the same address is
called collision.
 Keys those fight for the same address is called as synonyms.
– E.g. Here the keys “DINESH” and “IDIOT” are synonyms.
Collisions (Contd..)
 They cause many problems because we cannot insert more
than one key in one address.
 We should design an algorithm which will not give any
collisions.
 That kind of algorithm is called perfect hashing algorithm.
Practically this kind of algorithm is hard to achieve.
Avoidance of collisions
 This can be done in 3 ways.
– Spread out the records
– Use extra memory
– Put more than one record at a single address
11.2 A Simple Hashing Algorithm
 It consists of three steps.
– Step 1 : Represent the key in numerical form.
– Step 2 : Fold and add
– Step 3 : Divide by the size of the address space.
Represent the address in a Numerical Form
 If the key is already a number, we can skip this process.
 If it is a string consider the ASCII values of each character.
– Let us consider the key = “LOWELL”.
L O W E L L
76 79 87 69 76 76 32 32 32 32 32 32
Blank Spaces
2.Fold and Add
 It means chopping off pieces of the number and adding them
together.
L O W E L L
7679 | 8769 | 7676 | 3232 | 3232 | 3232
This process is chopping. We have to add these chopped
numbers in next substep.
Fold and add (contd..)
 While adding we have to check whether the sum is going
beyond the range of datatype.
 Let the range be 32767 (range of int in 16 bit compiler).We must
be sure that sum should not cross this range.
 So divide the sum in each iteration with prime number like
19937 (Why???).
Adding…
L O W E L L
7679 | 8769 | 7676 | 3232 | 3232 | 3232
7679+8769=16448  16448%19937=16448
16448+7676=24124  24124%19937=4187
4187+3232=10651  10651%19937=10651
10651+3232=13383  13383%19937=13383
Finally, Sum=13383.
Divide the size of address space
 a=s mod n
– Where a=home address
– s=sum in step 2
– n=number of addresses in a file.
 Since n is addresses in a file and can be very large. So choose
the prime closer to n.
Hash Function.
int Hash(char key[12],int maxAddress)
{
int sum=0;
for(int j=0;j<12;j+=2)
sum=(sum+100*key[j]+key[j+1])%19937;
return sum%maxAddress;
}
11.3.1 Distributing Records among Addresses
A
B
C
D
E
F
G
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
A
B
C
D
E
F
G
A
B
C
D
E
F
G
Uniform distribution All Synonyms A few Synonyms
11.3.2 Some other Hashing Methods
 Examine Keys for a pattern
 Fold parts of the key
 Divide the key by a number
 Square the key and take the middle
 Radix transformation
11.3.3 Predicting the Distribution of Records
 It is hard to tell the distribution of the records but we can predict
the distribution of the records.
 One of the prediction methods is Poission Distribution.
– p(x no of B’s and (r-x) no of A’s)=a^(r-x).b^x
– And the number of ways that x no of B’s and r-x no of A’s can be
arranged is C= r !
(r-x) ! * x !
 By rewriting the a as (1-1/N )and b as 1/N the poission
distribution is changed to
p(x)= (r/N)^x * e^(-r/N)
x!
In general if there are N addresses, then the expected number of
addresses with x records assigned to them is
N*p(x)
11.3.4 Predicting Collisions for a Full File
 Let a file contains 10000 records in 10000 addresses.
 Here , r=10000 and N=10000 then r/N=1
Substituting in p(x)= (r/N)^x * e^(-r/N)
p(0)= 1^0 * e^-1 = 0.3679
The number of addresses with no records assigned is
N*p(x)=10000*0.3679=3679
x!x!
0!
 Similiarly , the no of addresses one, two, and three records
assigned respectively are,
 10000*p(1)=3679
 10000*p(2)=1839
 10000*p(1)=613
 So, there will be 1839 overflows in two addresses,2*613
addresses in 3 addresses.

Hashing

  • 1.
  • 2.
    Outline  11.1 Introduction –11.1.1 What is Hashing – 11.1.2 Collisions  11.2 A Simple Hashing Algorithm  11.3 Hashing Functions and Record Distributions – 11.3.1 Distributing Records among Addressing – 11.3.2 Some Other Hashing Methods – 11.3.3 Predicting the Distribution of Records – 11.3.4Predicting Collisions for a Full File
  • 3.
    11.1.1 What isHashing  Hashing is the transformation of a string of characters into a usually shorter fixed-length value that represents the original string.  Hashing is used to index and retrieve items in a database.  It is so fast that it will take O(1) time to search an element.
  • 4.
    How hashing isdone?  It uses Hash Function. – Takes Key (K) as an argument. – Return Address (home address).  Hash Table – It is a datastructure similar to array. – The key is placed in the home address of hash table.
  • 5.
    Hash Function int MyHash(char*key) { int d; d=(key[0]*key[1])%1000; return d; }
  • 6.
    Searching , Insertionand Deletion Let us consider a hash table H[1000].  Inserting : – H[MyHash(key)]=key;  Deleting: – H[MyHash(key)]=‘0’;  Searching: – H[MyHash(key)];
  • 7.
  • 8.
    11.1.2 Collisions  Nowconsider the Key “IDIOT”. The home address for this key is 64 which is same as home address of “DINESH”.  The fighting between two different keys for the same address is called collision.  Keys those fight for the same address is called as synonyms. – E.g. Here the keys “DINESH” and “IDIOT” are synonyms.
  • 9.
    Collisions (Contd..)  Theycause many problems because we cannot insert more than one key in one address.  We should design an algorithm which will not give any collisions.  That kind of algorithm is called perfect hashing algorithm. Practically this kind of algorithm is hard to achieve.
  • 10.
    Avoidance of collisions This can be done in 3 ways. – Spread out the records – Use extra memory – Put more than one record at a single address
  • 11.
    11.2 A SimpleHashing Algorithm  It consists of three steps. – Step 1 : Represent the key in numerical form. – Step 2 : Fold and add – Step 3 : Divide by the size of the address space.
  • 12.
    Represent the addressin a Numerical Form  If the key is already a number, we can skip this process.  If it is a string consider the ASCII values of each character. – Let us consider the key = “LOWELL”. L O W E L L 76 79 87 69 76 76 32 32 32 32 32 32 Blank Spaces
  • 13.
    2.Fold and Add It means chopping off pieces of the number and adding them together. L O W E L L 7679 | 8769 | 7676 | 3232 | 3232 | 3232 This process is chopping. We have to add these chopped numbers in next substep.
  • 14.
    Fold and add(contd..)  While adding we have to check whether the sum is going beyond the range of datatype.  Let the range be 32767 (range of int in 16 bit compiler).We must be sure that sum should not cross this range.  So divide the sum in each iteration with prime number like 19937 (Why???).
  • 15.
    Adding… L O WE L L 7679 | 8769 | 7676 | 3232 | 3232 | 3232 7679+8769=16448  16448%19937=16448 16448+7676=24124  24124%19937=4187 4187+3232=10651  10651%19937=10651 10651+3232=13383  13383%19937=13383 Finally, Sum=13383.
  • 16.
    Divide the sizeof address space  a=s mod n – Where a=home address – s=sum in step 2 – n=number of addresses in a file.  Since n is addresses in a file and can be very large. So choose the prime closer to n.
  • 17.
    Hash Function. int Hash(charkey[12],int maxAddress) { int sum=0; for(int j=0;j<12;j+=2) sum=(sum+100*key[j]+key[j+1])%19937; return sum%maxAddress; }
  • 18.
    11.3.1 Distributing Recordsamong Addresses A B C D E F G 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A B C D E F G A B C D E F G Uniform distribution All Synonyms A few Synonyms
  • 19.
    11.3.2 Some otherHashing Methods  Examine Keys for a pattern  Fold parts of the key  Divide the key by a number  Square the key and take the middle  Radix transformation
  • 20.
    11.3.3 Predicting theDistribution of Records  It is hard to tell the distribution of the records but we can predict the distribution of the records.  One of the prediction methods is Poission Distribution. – p(x no of B’s and (r-x) no of A’s)=a^(r-x).b^x – And the number of ways that x no of B’s and r-x no of A’s can be arranged is C= r ! (r-x) ! * x !
  • 21.
     By rewritingthe a as (1-1/N )and b as 1/N the poission distribution is changed to p(x)= (r/N)^x * e^(-r/N) x! In general if there are N addresses, then the expected number of addresses with x records assigned to them is N*p(x)
  • 22.
    11.3.4 Predicting Collisionsfor a Full File  Let a file contains 10000 records in 10000 addresses.  Here , r=10000 and N=10000 then r/N=1 Substituting in p(x)= (r/N)^x * e^(-r/N) p(0)= 1^0 * e^-1 = 0.3679 The number of addresses with no records assigned is N*p(x)=10000*0.3679=3679 x!x! 0!
  • 23.
     Similiarly ,the no of addresses one, two, and three records assigned respectively are,  10000*p(1)=3679  10000*p(2)=1839  10000*p(1)=613  So, there will be 1839 overflows in two addresses,2*613 addresses in 3 addresses.