HASHing and Collisions of data structures.pptx

HASHING & COLLISIONS
• INTRODUCTION: DEFINITION, PROPERTIES,
REPRESENTATION, FUNCTIONS, & APPLICATIONS.
• HASH FUNCTIONS: DIVISION, MULTIPLICATION, MID-
SQUARE, & FOLDING METHODS.
• COLLISION RESOLUTION: LINEAR PROBING,
QUADRATIC PROBING, & SEPARATE CHAINING.

INTRODUCTION TO HASHING
 What is Hashing? Hashing is a method of storing and retrieving data in O(1) (Constant Time).
 The Problem: In an Array or Linked List, to find a specific number (e.g., "500"), you might have to check every
single slot. This is O(n), which is slow.
 The Solution: Hashing uses a formula to calculate the exact address. If you want "500", the formula tells you:
"Go immediately to Index 4".
 Representation:
 Hash Table: An array of fixed size M where data is stored.
 Hash Function (h(x)): A mathematical algorithm that takes an input (Key) and produces an output (Index).

PROPERTIES & APPLICATIONS
 Properties of a Good Hash Function:
 Low Cost: It must be very fast to calculate.
 Uniform Distribution: It should spread keys evenly across the table. It should not clump data into one area (Clustering).
 Deterministic: Input A must always equal Index B. Randomness is not allowed.
 Real-World Applications:
 Databases: Indexing for fast lookups (e.g., finding a user by ID).
 Cryptography: Storing passwords (e.g., MD5, SHA-256). We never store plain-text passwords; we store the hash.
 Compilers: "Symbol Tables" used to store variable names and function names during coding.
 Caches: Browser caches use hashing to quickly locate saved web pages.

HASH FUNCTION 1: DIVISION METHOD
 Map a key into a table slot by taking the remainder of the key divided by the table size.
 Formula:
 h(k)=k mod m
 (where k is the key and m is the table size)
 Detailed Example:
 Table Size (m): 13 (Using a Prime number is best to avoid patterns).
 Key (k): 25
 Calculation: 25÷13=1 with a remainder of 12.
 Result: Store 25 at Index 12.
 Pros & Cons:
 Pro: Extremely fast (just one division operation).
 Con: Poor performance if m is not a prime number (e.g., if m is 2p, the hash depends only on the last p bits).

HASH FUNCTION 2: MULTIPLICATION METHOD
 Multiply the key by a constant fraction (0<A<1), take the decimal part, and scale it to the table size.
 Steps:
 Choose constant A (Knuth suggests (5
−1)/2≈0.618).
 Multiply k×A.
 Take the fractional part (remove the whole number).
 Multiply by table size m.
 Take the floor (integer part).
 Size (m): 100
 Key (k): 123
 Constant (A): 0.618
 123×0.618=76.014
 Fraction is 0.014
 0.014×100=1.4
 Floor(1.4) = 1
 Result: Store at Index 1.

HASH FUNCTION 3: MID-SQUARE METHOD
 Good for randomization. Square the key to get a larger number, then extract the middle digits.
 Steps:
 Compute k2.
 Extract the middle r digits (where r depends on table size. If table is size 100, we need 2 digits).
 Table Size: 100 (Indices 00-99)
 Key: 45
 Square: 45×45=2025
 Extract Middle: The middle digits of 2025 are 02.
 Note: Why the middle? The middle digits depend on all digits of the original key, mixing the data thoroughly.

HASH FUNCTION 4: FOLDING METHOD
 Used for large keys (like Social Security Numbers or IP addresses). Break the key into chunks and add them up.
 Two Types:
 Fold Shift: Add parts as they are.
 Fold Boundary: Reverse the boundary parts before adding (more complex mixing).
 Detailed Example (Fold Shift):
 Key: 123456789
 Table Size: 1000 (We need 3-digit indices).
 Split: 123 | 456 | 789
 Add: 123+456+789=1368
 Wrap: Ignore the leading '1' to fit in size 1000.

COLLISION RESOLUTION TECHNIQUES
 What is a Collision? When two different keys generate the same index.
 Hash(15)→5
 Hash(25)→5
 We cannot store two items in one array slot.
 We need a strategy to fix this:
 Open Hashing (Separate Chaining): Store collisions outside the table (in a list).
 Closed Hashing (Open Addressing): Find another empty slot inside the table.

TECHNIQUE 1: SEPARATE CHAINING
 Each slot in the hash table points to a Linked List. If a collision happens, add the new item to the end of the list at
that index.
 Example:
 Function: kmod10
 Insert 12: Index 2. [12]
 Insert 22: Index 2. Collision! [12] -> [22]
 Insert 32: Index 2. Collision! [12] -> [22] -> [32]
 Analysis:
 Pro: The table never gets "full".
 Con: Uses extra memory for pointers. Search speed degrades to O(n) if the chain gets too long.

TECHNIQUE 2: LINEAR PROBING (OPEN ADDRESSING)
 If the calculated slot is full, check the next available slot sequentially.
 Formula: Index=(Hash(k)+i)modm (i = 1, 2, 3...)
 Example:
 Function: kmod10
 Insert 55: Index 5. (Placed)
 Insert 65: Index 5 (Full).
 Check Index 6. (Empty? Yes. Place 65 here).
 Insert 75: Index 5 (Full).
 Check Index 6 (Full).
 Check Index 7 (Empty? Yes. Place 75 here).
 The Problem: Primary Clustering. Long blocks of occupied cells form, increasing search time for future items.

TECHNIQUE 3: QUADRATIC PROBING (OPEN ADDRESSING)
 To fix the clustering problem of Linear Probing, we don't jump by 1. We jump by squares (12,22,32,42).
 Formula: Index=(Hash(k)+i2)modm
 Example:
 Insert at Index 5 (Full).
 Attempt 1: 5+12=6. (Full?)
 Attempt 2: 5+22=9. (Full?)
 Attempt 3: 5+32=14mod10=4. (Empty? Place here).
 Analysis:
 Pro: Reduces Primary Clustering.
 Con: Can still suffer from "Secondary Clustering" (keys hashing to the same start point follow the same path).

SUMMARY
 Goal: Search and Insert in O(1) time.
 Hash Functions: Division (k%m) is simplest. Multiplication & Mid-Square provide better randomness. Folding is
for large keys.
 Separate Chaining: Uses Linked Lists. Good for unlimited data.
 Linear Probing: Jumps +1. Simple but causes clustering.
 Quadratic Probing: Jumps +1,+4,+9. Reduces clustering.

HASHing and Collisions of data structures.pptx

More Related Content

Similar to HASHing and Collisions of data structures.pptx

Recently uploaded

HASHing and Collisions of data structures.pptx