HASHING & COLLISIONS
• INTRODUCTION: DEFINITION, PROPERTIES,
REPRESENTATION, FUNCTIONS, & APPLICATIONS.
• HASH FUNCTIONS: DIVISION, MULTIPLICATION, MID-
SQUARE, & FOLDING METHODS.
• COLLISION RESOLUTION: LINEAR PROBING,
QUADRATIC PROBING, & SEPARATE CHAINING.
INTRODUCTION TO HASHING
 What is Hashing? Hashing is a method of storing and retrieving data in O(1) (Constant Time).
 The Problem: In an Array or Linked List, to find a specific number (e.g., "500"), you might have to check every
single slot. This is O(n), which is slow.
 The Solution: Hashing uses a formula to calculate the exact address. If you want "500", the formula tells you:
"Go immediately to Index 4".
 Representation:
 Hash Table: An array of fixed size M where data is stored.
 Hash Function (h(x)): A mathematical algorithm that takes an input (Key) and produces an output (Index).
PROPERTIES & APPLICATIONS
 Properties of a Good Hash Function:
 Low Cost: It must be very fast to calculate.
 Uniform Distribution: It should spread keys evenly across the table. It should not clump data into one area (Clustering).
 Deterministic: Input A must always equal Index B. Randomness is not allowed.
 Real-World Applications:
 Databases: Indexing for fast lookups (e.g., finding a user by ID).
 Cryptography: Storing passwords (e.g., MD5, SHA-256). We never store plain-text passwords; we store the hash.
 Compilers: "Symbol Tables" used to store variable names and function names during coding.
 Caches: Browser caches use hashing to quickly locate saved web pages.
HASH FUNCTION 1: DIVISION METHOD
 Map a key into a table slot by taking the remainder of the key divided by the table size.
 Formula:
 h(k)=k mod m
 (where k is the key and m is the table size)
 Detailed Example:
 Table Size (m): 13 (Using a Prime number is best to avoid patterns).
 Key (k): 25
 Calculation: 25÷13=1 with a remainder of 12.
 Result: Store 25 at Index 12.
 Pros & Cons:
 Pro: Extremely fast (just one division operation).
 Con: Poor performance if m is not a prime number (e.g., if m is 2p, the hash depends only on the last p bits).
HASH FUNCTION 2: MULTIPLICATION METHOD
 Multiply the key by a constant fraction (0<A<1), take the decimal part, and scale it to the table size.
 Steps:
 Choose constant A (Knuth suggests (5​
−1)/2≈0.618).
 Multiply k×A.
 Take the fractional part (remove the whole number).
 Multiply by table size m.
 Take the floor (integer part).
 Detailed Example:
 Size (m): 100
 Key (k): 123
 Constant (A): 0.618
 123×0.618=76.014
 Fraction is 0.014
 0.014×100=1.4
 Floor(1.4) = 1
 Result: Store at Index 1.
HASH FUNCTION 3: MID-SQUARE METHOD
 Good for randomization. Square the key to get a larger number, then extract the middle digits.
 Steps:
 Compute k2.
 Extract the middle r digits (where r depends on table size. If table is size 100, we need 2 digits).
 Detailed Example:
 Table Size: 100 (Indices 00-99)
 Key: 45
 Square: 45×45=2025
 Extract Middle: The middle digits of 2025 are 02.
 Result: Store at Index 2.
 Note: Why the middle? The middle digits depend on all digits of the original key, mixing the data thoroughly.
HASH FUNCTION 3: MID-SQUARE METHOD
 Good for randomization. Square the key to get a larger number, then extract the middle digits.
 Steps:
 Compute k2.
 Extract the middle r digits (where r depends on table size. If table is size 100, we need 2 digits).
 Detailed Example:
 Table Size: 100 (Indices 00-99)
 Key: 45
 Square: 45×45=2025
 Extract Middle: The middle digits of 2025 are 02.
 Result: Store at Index 2.
 Note: Why the middle? The middle digits depend on all digits of the original key, mixing the data thoroughly.
HASH FUNCTION 4: FOLDING METHOD
 Used for large keys (like Social Security Numbers or IP addresses). Break the key into chunks and add them up.
 Two Types:
 Fold Shift: Add parts as they are.
 Fold Boundary: Reverse the boundary parts before adding (more complex mixing).
 Detailed Example (Fold Shift):
 Key: 123456789
 Table Size: 1000 (We need 3-digit indices).
 Split: 123 | 456 | 789
 Add: 123+456+789=1368
 Wrap: Ignore the leading '1' to fit in size 1000.
 Result: Store at Index 368.
COLLISION RESOLUTION TECHNIQUES
 What is a Collision? When two different keys generate the same index.
 Hash(15)→5
 Hash(25)→5
 We cannot store two items in one array slot.
 We need a strategy to fix this:
 Open Hashing (Separate Chaining): Store collisions outside the table (in a list).
 Closed Hashing (Open Addressing): Find another empty slot inside the table.
TECHNIQUE 1: SEPARATE CHAINING
 Each slot in the hash table points to a Linked List. If a collision happens, add the new item to the end of the list at
that index.
 Example:
 Function: kmod10
 Insert 12: Index 2. [12]
 Insert 22: Index 2. Collision! [12] -> [22]
 Insert 32: Index 2. Collision! [12] -> [22] -> [32]
 Analysis:
 Pro: The table never gets "full".
 Con: Uses extra memory for pointers. Search speed degrades to O(n) if the chain gets too long.
TECHNIQUE 2: LINEAR PROBING (OPEN ADDRESSING)
 If the calculated slot is full, check the next available slot sequentially.
 Formula: Index=(Hash(k)+i)modm (i = 1, 2, 3...)
 Example:
 Function: kmod10
 Insert 55: Index 5. (Placed)
 Insert 65: Index 5 (Full).
 Check Index 6. (Empty? Yes. Place 65 here).
 Insert 75: Index 5 (Full).
 Check Index 6 (Full).
 Check Index 7 (Empty? Yes. Place 75 here).
 The Problem: Primary Clustering. Long blocks of occupied cells form, increasing search time for future items.
TECHNIQUE 3: QUADRATIC PROBING (OPEN ADDRESSING)
 To fix the clustering problem of Linear Probing, we don't jump by 1. We jump by squares (12,22,32,42).
 Formula: Index=(Hash(k)+i2)modm
 Example:
 Insert at Index 5 (Full).
 Attempt 1: 5+12=6. (Full?)
 Attempt 2: 5+22=9. (Full?)
 Attempt 3: 5+32=14mod10=4. (Empty? Place here).
 Analysis:
 Pro: Reduces Primary Clustering.
 Con: Can still suffer from "Secondary Clustering" (keys hashing to the same start point follow the same path).
SUMMARY
 Goal: Search and Insert in O(1) time.
 Hash Functions: Division (k%m) is simplest. Multiplication & Mid-Square provide better randomness. Folding is
for large keys.
 Separate Chaining: Uses Linked Lists. Good for unlimited data.
 Linear Probing: Jumps +1. Simple but causes clustering.
 Quadratic Probing: Jumps +1,+4,+9. Reduces clustering.

HASHing and Collisions of data structures.pptx

  • 1.
    HASHING & COLLISIONS •INTRODUCTION: DEFINITION, PROPERTIES, REPRESENTATION, FUNCTIONS, & APPLICATIONS. • HASH FUNCTIONS: DIVISION, MULTIPLICATION, MID- SQUARE, & FOLDING METHODS. • COLLISION RESOLUTION: LINEAR PROBING, QUADRATIC PROBING, & SEPARATE CHAINING.
  • 2.
    INTRODUCTION TO HASHING What is Hashing? Hashing is a method of storing and retrieving data in O(1) (Constant Time).  The Problem: In an Array or Linked List, to find a specific number (e.g., "500"), you might have to check every single slot. This is O(n), which is slow.  The Solution: Hashing uses a formula to calculate the exact address. If you want "500", the formula tells you: "Go immediately to Index 4".  Representation:  Hash Table: An array of fixed size M where data is stored.  Hash Function (h(x)): A mathematical algorithm that takes an input (Key) and produces an output (Index).
  • 3.
    PROPERTIES & APPLICATIONS Properties of a Good Hash Function:  Low Cost: It must be very fast to calculate.  Uniform Distribution: It should spread keys evenly across the table. It should not clump data into one area (Clustering).  Deterministic: Input A must always equal Index B. Randomness is not allowed.  Real-World Applications:  Databases: Indexing for fast lookups (e.g., finding a user by ID).  Cryptography: Storing passwords (e.g., MD5, SHA-256). We never store plain-text passwords; we store the hash.  Compilers: "Symbol Tables" used to store variable names and function names during coding.  Caches: Browser caches use hashing to quickly locate saved web pages.
  • 4.
    HASH FUNCTION 1:DIVISION METHOD  Map a key into a table slot by taking the remainder of the key divided by the table size.  Formula:  h(k)=k mod m  (where k is the key and m is the table size)  Detailed Example:  Table Size (m): 13 (Using a Prime number is best to avoid patterns).  Key (k): 25  Calculation: 25÷13=1 with a remainder of 12.  Result: Store 25 at Index 12.  Pros & Cons:  Pro: Extremely fast (just one division operation).  Con: Poor performance if m is not a prime number (e.g., if m is 2p, the hash depends only on the last p bits).
  • 5.
    HASH FUNCTION 2:MULTIPLICATION METHOD  Multiply the key by a constant fraction (0<A<1), take the decimal part, and scale it to the table size.  Steps:  Choose constant A (Knuth suggests (5​ −1)/2≈0.618).  Multiply k×A.  Take the fractional part (remove the whole number).  Multiply by table size m.  Take the floor (integer part).  Detailed Example:  Size (m): 100  Key (k): 123  Constant (A): 0.618  123×0.618=76.014  Fraction is 0.014  0.014×100=1.4  Floor(1.4) = 1  Result: Store at Index 1.
  • 6.
    HASH FUNCTION 3:MID-SQUARE METHOD  Good for randomization. Square the key to get a larger number, then extract the middle digits.  Steps:  Compute k2.  Extract the middle r digits (where r depends on table size. If table is size 100, we need 2 digits).  Detailed Example:  Table Size: 100 (Indices 00-99)  Key: 45  Square: 45×45=2025  Extract Middle: The middle digits of 2025 are 02.  Result: Store at Index 2.  Note: Why the middle? The middle digits depend on all digits of the original key, mixing the data thoroughly.
  • 7.
    HASH FUNCTION 3:MID-SQUARE METHOD  Good for randomization. Square the key to get a larger number, then extract the middle digits.  Steps:  Compute k2.  Extract the middle r digits (where r depends on table size. If table is size 100, we need 2 digits).  Detailed Example:  Table Size: 100 (Indices 00-99)  Key: 45  Square: 45×45=2025  Extract Middle: The middle digits of 2025 are 02.  Result: Store at Index 2.  Note: Why the middle? The middle digits depend on all digits of the original key, mixing the data thoroughly.
  • 8.
    HASH FUNCTION 4:FOLDING METHOD  Used for large keys (like Social Security Numbers or IP addresses). Break the key into chunks and add them up.  Two Types:  Fold Shift: Add parts as they are.  Fold Boundary: Reverse the boundary parts before adding (more complex mixing).  Detailed Example (Fold Shift):  Key: 123456789  Table Size: 1000 (We need 3-digit indices).  Split: 123 | 456 | 789  Add: 123+456+789=1368  Wrap: Ignore the leading '1' to fit in size 1000.  Result: Store at Index 368.
  • 9.
    COLLISION RESOLUTION TECHNIQUES What is a Collision? When two different keys generate the same index.  Hash(15)→5  Hash(25)→5  We cannot store two items in one array slot.  We need a strategy to fix this:  Open Hashing (Separate Chaining): Store collisions outside the table (in a list).  Closed Hashing (Open Addressing): Find another empty slot inside the table.
  • 10.
    TECHNIQUE 1: SEPARATECHAINING  Each slot in the hash table points to a Linked List. If a collision happens, add the new item to the end of the list at that index.  Example:  Function: kmod10  Insert 12: Index 2. [12]  Insert 22: Index 2. Collision! [12] -> [22]  Insert 32: Index 2. Collision! [12] -> [22] -> [32]  Analysis:  Pro: The table never gets "full".  Con: Uses extra memory for pointers. Search speed degrades to O(n) if the chain gets too long.
  • 11.
    TECHNIQUE 2: LINEARPROBING (OPEN ADDRESSING)  If the calculated slot is full, check the next available slot sequentially.  Formula: Index=(Hash(k)+i)modm (i = 1, 2, 3...)  Example:  Function: kmod10  Insert 55: Index 5. (Placed)  Insert 65: Index 5 (Full).  Check Index 6. (Empty? Yes. Place 65 here).  Insert 75: Index 5 (Full).  Check Index 6 (Full).  Check Index 7 (Empty? Yes. Place 75 here).  The Problem: Primary Clustering. Long blocks of occupied cells form, increasing search time for future items.
  • 12.
    TECHNIQUE 3: QUADRATICPROBING (OPEN ADDRESSING)  To fix the clustering problem of Linear Probing, we don't jump by 1. We jump by squares (12,22,32,42).  Formula: Index=(Hash(k)+i2)modm  Example:  Insert at Index 5 (Full).  Attempt 1: 5+12=6. (Full?)  Attempt 2: 5+22=9. (Full?)  Attempt 3: 5+32=14mod10=4. (Empty? Place here).  Analysis:  Pro: Reduces Primary Clustering.  Con: Can still suffer from "Secondary Clustering" (keys hashing to the same start point follow the same path).
  • 13.
    SUMMARY  Goal: Searchand Insert in O(1) time.  Hash Functions: Division (k%m) is simplest. Multiplication & Mid-Square provide better randomness. Folding is for large keys.  Separate Chaining: Uses Linked Lists. Good for unlimited data.  Linear Probing: Jumps +1. Simple but causes clustering.  Quadratic Probing: Jumps +1,+4,+9. Reduces clustering.