SlideShare a Scribd company logo
1 of 66
1
Course OutlineCourse Outline
Introduction and Algorithm Analysis (Ch. 2)
Hash Tables: dictionary data structure (Ch. 5)
Heaps: priority queue data structures (Ch. 6)
Balanced Search Trees: general search structures (Ch. 4.1-4.5)
Union-Find data structure (Ch. 8.1–8.5)
Graphs: Representations and basic algorithms
 Topological Sort (Ch. 9.1-9.2)
 Minimum spanning trees (Ch. 9.5)
 Shortest-path algorithms (Ch. 9.3.2)
B-Trees: External-Memory data structures (Ch. 4.7)
kD-Trees: Multi-Dimensional data structures (Ch. 12.6)
Misc.: Streaming data, randomization
2
Data Structures for SetsData Structures for Sets
Many applications deal with sets.
 Compilers have symbol tables (set of vars, classes)
 IP routers have IP addresses, packet forwarding rules
 Web servers have set of clients, etc.
 Dictionary is a set of words.
A set is a collection of members
 No repetition of members
 Members themselves can be sets
Examples
 {x | x is a positive integer and x < 100}
 {x | x is a CA driver with > 10 years of driving experience
and 0 accidents in the last 3 years}
 All webpages related containing the word Algorithms
3
Abstract Data TypesAbstract Data Types
Set + Operations define an ADT.
 A set + insert, delete, find
 A set + ordering
 Multiple sets + union, insert, delete
 Multiple sets + merge
 Etc.
Depending on type of members and choice of
operations, different implementations can have
different asymptotic complexity.
4
DictionaryDictionary ADTsADTs
Data structure with just 3 basic operations:
 find (i): find item with key i
 insert (i): insert i into the dictionary
 remove (i): delete i
 Just like words in a Dictionary
Where do we use them:
 Symbol tables for compiler
 Customer records (access by name)
 Games (positions, configurations)
 Spell checkers
 P2P systems (access songs by name), etc.
5
Naïve Method: Linked ListNaïve Method: Linked List
Keep a linked list of the keys
 insert (i): add to the head of list. Easy and fast O(1)
 find (i): worst-case, search the whole list (linear)
 remove (i): also linear in worst-case
6
Another Naïve Method: Direct MappingAnother Naïve Method: Direct Mapping
Maintain an array (bit
vector) for all possible
keys
 insert (i): set A[i] = 1
 find (i): return A[i]
 remove (i): set A[i] = 0
Student Records
1
2
3
8
9
13
14
Graduates
Perm #
7
Another Naïve Method: Direct MappingAnother Naïve Method: Direct Mapping
Maintain an array (bit vector) for all possible keys
 insert (i): set A[i] = 1
 find (i): return A[i]
 remove (i): set A[i] = 0
All operations easy and fast O(1)
What’s the drawback?
Too much memory/space, and wasteful!
The space of all possible IP addresses, variable names in a
compiler is enormous!
8
Dictionary ADT: Naïve ImplementationsDictionary ADT: Naïve Implementations
O(1) time possible but space-inefficient.
Linked list space-efficient, but search-inefficient.
 Insert is O(1) but find and delete are O(n).
A sorted array does not help, even with ordered keys. The
search becomes fast, but insert/delete take O(n).
Balanced search trees (Chap. 4) work but take O(log n)
time per operation, and complicated.
9
Towards an Efficient Data Structure: Hash TableTowards an Efficient Data Structure: Hash Table
Formal Setup
 The keys to be managed come from a known but very
large set, called universe U
 We can assume keys are integers {0, 1, …, |U|}
 Non-numeric keys (strings, webpages) converted to
numbers: Sum of ASCII values, first three characters
The set of keys to be managed is S, a subset of U.
The size of S is much smaller than U, namely, |S| << |U|
We use n for |S|.
10
Hash TableHash Table
Hash Tables use a Hash Function h to map each
input key to a unique location in table of size M
 h : U -> {0, 1, …, M-1}
 hash function determines the hash table size.hash function determines the hash table size.
Desiderata:
 M should be small, O(n)
 h should be easy to compute
 Typical example: h(i) = i mod M
11
Hashing : the basic ideaHashing : the basic idea
9
10
20
39
4
14
8
Graduates
Perm # (mod 9)
Student Records
12
Hash Tables: IntuitionHash Tables: Intuition
Unique location lets us find an item in O(1) time.
 Each item is uniquely identified by a key
Just check the location h(key) to find the item
What can go wrong?
Suppose we expect to have at most 100 keys in S
 91, 2048, 329, 17, 689345, ….
We create a table of size 100 and use the hash
function h(key) = key mod 100
It is both fast and uses the ideal size table.
13
Hashing:Hashing:
But what if all keys end with 00?
 All keys will map to the same location
 This is called a Collision in Hashing
This motivates the 3This motivates the 3rdrd
important property of hashingimportant property of hashing
 A good hash function should evenly spread theA good hash function should evenly spread the
keys to foil any special structure of inputkeys to foil any special structure of input
 Hashing with mod 100 works fine if keys randomHashing with mod 100 works fine if keys random
 Most data (e.g. program variables) are not randomMost data (e.g. program variables) are not random
14
Hashing:Hashing:
A good hash function should evenly spread theA good hash function should evenly spread the
keys to foil any special structure of inputkeys to foil any special structure of input
Key idea behind hashing is to “simulate” theKey idea behind hashing is to “simulate” the
randomnessrandomness through the hash functionthrough the hash function
A good choice isA good choice is h(x) = x mod ph(x) = x mod p, for prime p, for prime p
h(x) = (ax + b) mod ph(x) = (ax + b) mod p called pseudo-random hashcalled pseudo-random hash
functionsfunctions
15
Hashing: The Basic SetupHashing: The Basic Setup
Choose a pseudo-random hash function hChoose a pseudo-random hash function h
 this automatically determines the hash table size.this automatically determines the hash table size.
An item with key k is put atAn item with key k is put at location h(k)location h(k)..
To find an item with key k, check location h(k).To find an item with key k, check location h(k).
What to doWhat to do if more than one keys hash to theif more than one keys hash to the
samesame value. This is calledvalue. This is called collisioncollision..
We will discuss two methods to handle collision:We will discuss two methods to handle collision:
 Separate chaining
 Open addressing
16
Maintain a list of all elements that
hash to the same value
Search using the hash function to
determine which list to traverse
Insert/deletion–once the “bucket”
is found through Hash, insert and
delete are list operations
Separate chainingSeparate chaining
class HashTable {
……
private:
unsigned int Hsize;
List<E,K> *TheList;
……
find(k,e)
HashVal = Hash(k,Hsize);
if (TheList[HashVal].Search(k,e))
then return true;
else return false;
14
42
29
20
1
36
5623
16
24
31
17
7
0
1
2
3
4
5
6
7
8
9
10
17
Insertion: insert 53Insertion: insert 53
14
42
29
20
1
36
5623
16
24
31
17
7
0
1
2
3
4
5
6
7
8
9
10
53 = 4 x 11 + 9
53 mod 11 = 9
14
42
29
20
1
36
5623
16
24
53
17
7
0
1
2
3
4
5
6
7
8
9
10
31
18
Analysis of Hashing with ChainingAnalysis of Hashing with Chaining
Worst case
 All keys hash into the same bucket
 a single linked list.
 insert, delete, find take O(n) time.
 A worst-case Theorem later
Average case
 Keys are uniformly distributed into buckets
 Load Factor L = InputSize/HashTableSize
 In a failed search, avg cost is L
 In a successful search, avg cost is 1 + L/2
19
Open addressingOpen addressing
If collision happens, alternative
cells are tried until an empty cell
is found.
Linear probing :
Try next available position
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
20
Linear Probing (insert 12)Linear Probing (insert 12)
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12 = 1 x 11 + 1
12 mod 11 = 1
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12
21
Search with linear probing (Search 15)Search with linear probing (Search 15)
15 = 1 x 11 + 4
15 mod 11 = 4
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12
NOT FOUND !
22
// find the slot where searched item should be in
int HashTable<E,K>::hSearch(const K& k) const
{
int HashVal = k % D;
int j = HashVal;
do {// don’t search past the first empty slot (insert should put it there)
if (empty[j] || ht[j] == k) return j;
j = (j + 1) % D;
} while (j != HashVal);
return j; // no empty slot and no match either, give up
}
bool HashTable<E,K>::find(const K& k, E& e) const
{
int b = hSearch(k);
if (empty[b] || ht[b] != k) return false;
e = ht[b];
return true;
}
Search with linear probingSearch with linear probing
23
Deletion in Hashing with Linear ProbingDeletion in Hashing with Linear Probing
Since empty buckets are used to terminate search,
standard deletion does not work.
One simple idea is to not delete, but mark.
 Insert: put item in first empty or marked bucket.
 Search: Continue past marked buckets.
 Delete: just mark the bucket as deleted.
Advantage: Easy and correct.
Disadvantage: table can become full with dead items.
Avg. cost for successful searches ½ (1 + 1/(1 – L))
Failed search avg. cost more ½ (1 + 1/(1 – L)2
)
24
Deletion with linear probing:Deletion with linear probing: LAZY (Delete 9)LAZY (Delete 9)
9 = 0 x 11 + 9
9 mod 11 = 9
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12
FOUND !
0
1
2
3
4
5
6
7
8
9
10
42
D
14
1
16
24
31
28
7
12
25
remove(j)
{ i = j;
empty[i] = true;
i = (i + 1) % D; // candidate for swapping
while ((not empty[i]) and i!=j) {
r = Hash(ht[i]); // where should it go without collision?
// can we still find it based on the rehashing strategy?
if not ((j<r<=i) or (i<j<r) or (r<=i<j))
then break; // yes find it from rehashing, swap
i = (i + 1) % D; // no, cannot find it from rehashing
}
if (i!=j and not empty[i])
then {
ht[j] = ht[i];
remove(i);
}
}
Eager Deletion: fill holesEager Deletion: fill holes
Remove and find replacement:
 Fill in the hole for later searches
26
Eager Deletion Analysis (cont.)Eager Deletion Analysis (cont.)
 If not full
 After deletion, there will be at least two holes
 Elements that are affected by the new hole are
 Initial hashed location is cyclically before the new
hole
 Location after linear probing is in between the new
hole and the next hole in the search order
 Elements are movable to fill the hole
Next hole in the search orderNew hole
Initial
hashed location
Location after
linear probing
Next hole in the search order
Initial
hashed location
27
Eager Deletion Analysis (cont.)Eager Deletion Analysis (cont.)
The important thing is to make sure that if a
replacement (i) is swapped into deleted (j), we
can still find that element. How can we not find it?
 If the original hashed position (r) is circularly in
between deleted and the replacement
j r i
j ri
jr i
i r
Will not find i past the empty green slot!
j i r i r
Will find i
28
Quadratic ProbingQuadratic Probing
Solves the clustering problem in Linear ProbingSolves the clustering problem in Linear Probing
 Check H(x)
 If collision occurs check H(x) + 1
 If collision occurs check H(x) + 4
 If collision occurs check H(x) + 9
 If collision occurs check H(x) + 16
 ...
 H(x) + i2
29
Quadratic Probing (insert 12)Quadratic Probing (insert 12)
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12 = 1 x 11 + 1
12 mod 11 = 1
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12
30
Double HashingDouble Hashing
When collision occurs use a second hash functionWhen collision occurs use a second hash function
 Hash2 (x) = R – (x mod R)
 R: greatest prime number smaller than table-size
Inserting 12Inserting 12
H2(x) = 7 – (x mod 7) = 7 – (12 mod 7) = 2
 Check H(x)
 If collision occurs check H(x) + 2
 If collision occurs check H(x) + 4
 If collision occurs check H(x) + 6
 If collision occurs check H(x) + 8
 H(x) + i * H2(x)
31
Double Hashing (insert 12)Double Hashing (insert 12)
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12 = 1 x 11 + 1
12 mod 11 = 1
7 –12 mod 7 = 2
0
1
2
3
4
5
6
7
8
9
10
42
9
14
1
16
24
31
28
7
12
32
RehashingRehashing
If table gets too full, operations will take too long.
Build another table, twice as big (and prime).
 Next prime number after 11 x 2 is 23
Insert every element again to this table
Rehash after a percentage of the table becomes
full (70% for example)
33
Collision FunctionsCollision Functions
Hi(x)= (H(x)+i) mod B
 Linear pobing
Hi(x)= (H(x)+c*i) mod B (c > 1)
 Linear probing with step-size = c
Hi(x)= (H(x)+i2
) mod B
 Quadratic probing
Hi(x)= (H(x)+ i * H2(x)) mod B
34
Analysis of Open HashingAnalysis of Open Hashing
Effort of one Insert?
 Intuitively – that depends on how full the hash is
Effort of an average Insert?
Effort to fill the Bucket to a certain capacity?
 Intuitively – accumulated efforts in inserts
Effort to search an item (both successful and
unsuccessful)?
Effort to delete an item (both successful and
unsuccessful)?
 Same effort for successful search and delete?
 Same effort for unsuccessful search and delete?
35
Issues:Issues:
What do we lose?What do we lose?
 Operations that require ordering are inefficient
 FindMax: O(n) O(log n) Balanced binary tree
 FindMin: O(n) O(log n) Balanced binary tree
 PrintSorted: O(n log n) O(n) Balanced binary tree
What do we gain?What do we gain?
 Insert: O(1) O(log n) Balanced binary tree
 Delete: O(1) O(log n) Balanced binary tree
 Find: O(1) O(log n) Balanced binary tree
How to handle Collision?How to handle Collision?
 Separate chaining
 Open addressing
36
Theory of HashingTheory of Hashing
First the bad news.First the bad news.
TheoremTheorem:: ForFor anyany hash function h: U -> {0, 1, …, M}, therehash function h: U -> {0, 1, …, M}, there
exists a set S of n keys thatexists a set S of n keys that all map to the same locationall map to the same location,,
assuming |U| > nM.assuming |U| > nM.
So, in the worst-case no hash function can avoid linear searchSo, in the worst-case no hash function can avoid linear search
complexity!complexity!
Proof.Proof.
Take any hash function h you wish to considerTake any hash function h you wish to consider
Map all the keys of U using h to the table of size MMap all the keys of U using h to the table of size M
By the pigeon-hole principle, at least one table entry will have nBy the pigeon-hole principle, at least one table entry will have n
keys.keys.
Choose those n keys as input set S.Choose those n keys as input set S.
Now h will maps the entire set S to a single location, for worst-caseNow h will maps the entire set S to a single location, for worst-case
example of hashingexample of hashing..
37
Theory of HashingTheory of Hashing
The negative result says thatThe negative result says that given a fixed hash function hgiven a fixed hash function h,,
one can always construct a set S that is bad for h.one can always construct a set S that is bad for h.
However, what we desire is something different:However, what we desire is something different:
We are not choosing S; it is our (given) input.We are not choosing S; it is our (given) input.
Can we find a good h for this particular S?Can we find a good h for this particular S?
Theory shows that a random choice of h works.Theory shows that a random choice of h works.
38
Theory of Hashing: Birthday ParadoxTheory of Hashing: Birthday Paradox
To appreciate the subtlety of hashing, first consider aTo appreciate the subtlety of hashing, first consider a
puzzle:puzzle: the birthday paradoxthe birthday paradox..
Suppose birth days are chance events:Suppose birth days are chance events:
date of birth is purely randomdate of birth is purely random
any day of the year just as likely as anotherany day of the year just as likely as another
39
Theory of Hashing: Birthday ParadoxTheory of Hashing: Birthday Paradox
What are the chances that in a group of 30 people, atWhat are the chances that in a group of 30 people, at
least two have the same birthday?least two have the same birthday?
How many people will be needed to have at least 50%How many people will be needed to have at least 50%
chance of same birthday?chance of same birthday?
It’s called a paradox because the answer appears to beIt’s called a paradox because the answer appears to be
counter-intuitive.counter-intuitive.
There are 365 different birthdays, so for 50% chance, youThere are 365 different birthdays, so for 50% chance, you
expect at least 182 people.expect at least 182 people.
40
Birthday Paradox: the mathBirthday Paradox: the math
Suppose 2 people in the room.Suppose 2 people in the room.
What is the prob. that they have the same birthday?What is the prob. that they have the same birthday?
Answer is 1/365.Answer is 1/365.
All birthdays are equally likely, so B’s birthday falls on A’sAll birthdays are equally likely, so B’s birthday falls on A’s
birthday 1 in 365 times.birthday 1 in 365 times.
Now suppose there are k people in the room.Now suppose there are k people in the room.
It’s more convenient to calculate the prob. X that no twoIt’s more convenient to calculate the prob. X that no two
have the same birthday.have the same birthday.
Our answer will be the (1 – X)Our answer will be the (1 – X)
41
Birthday ParadoxBirthday Paradox
Define PDefine Pii = prob. that first i all have distinct birthdays= prob. that first i all have distinct birthdays
For convenience, define p = 1/365For convenience, define p = 1/365
PP11 = 1.= 1.
PP22 = (1 – p)= (1 – p)
PP33 = (1 – p) * (1 – 2p)= (1 – p) * (1 – 2p)
PPkk = (1 – p) * (1 – 2p) * …. * (1 – (k-1)p)= (1 – p) * (1 – 2p) * …. * (1 – (k-1)p)
You can now verify that for k=23, PYou can now verify that for k=23, Pkk <= 0.4999<= 0.4999
That is, with just 23 people in the room, there is more thanThat is, with just 23 people in the room, there is more than
50% chance that two have the same birthday50% chance that two have the same birthday
42
Birthday Paradox: derivationBirthday Paradox: derivation
Use 1 – x <= eUse 1 – x <= e-x-x
, for all x, for all x
Therefore, 1 – j*p <= eTherefore, 1 – j*p <= e-jp-jp
Also, eAlso, exx
+ e+ eyy
= e= ex+yx+y
Therefore, PTherefore, Pkk <= e<= e(-p -2p -3p … -(k-1)p)(-p -2p -3p … -(k-1)p)
PPkk <= e<= e-k(k-1)p/2-k(k-1)p/2
For k = 23, we have k(k-1)/2*365 = 0.69For k = 23, we have k(k-1)/2*365 = 0.69
ee-0.69-0.69
<= 0.4999<= 0.4999
Connection to Hashing:Connection to Hashing:
Suppose n = 23, and hash table has size M = 365.Suppose n = 23, and hash table has size M = 365.
50% chance that 2 keys will land in the same bucket.50% chance that 2 keys will land in the same bucket.
43
Theory of Hashing: Universal Hash FunctionsTheory of Hashing: Universal Hash Functions
AA setset of hash functions H is called universal if for any hashof hash functions H is called universal if for any hash
function h chosen randomly from itfunction h chosen randomly from it
Prob[h(x) = h(y)]Prob[h(x) = h(y)] <=<= 1/M, for any x, y in U1/M, for any x, y in U
TheoremTheorem.. Suppose H is universal, S is an n-element subset ofSuppose H is universal, S is an n-element subset of
U, and h a random hash function from H.U, and h a random hash function from H.
The expected number of collisions is at most (n-1)/M forThe expected number of collisions is at most (n-1)/M for
any x in S.any x in S.
44
Theory of Hashing: Universal Hash FunctionsTheory of Hashing: Universal Hash Functions
TheoremTheorem.. Suppose H is universal, S is an n-element subset ofSuppose H is universal, S is an n-element subset of
U, and h a random hash function from H.U, and h a random hash function from H.
The expected number of collisions is at most (n-1)/M forThe expected number of collisions is at most (n-1)/M for
any x in S.any x in S.
Proof.Proof.
Consider any x in S. For any other y, the prob. thatConsider any x in S. For any other y, the prob. that
h(y) = h(x) is at most 1/M (by universal hashing)h(y) = h(x) is at most 1/M (by universal hashing)
By linearity of expectation, the number of keys mapping toBy linearity of expectation, the number of keys mapping to
h(x) is at most (n-1)/M.h(x) is at most (n-1)/M.
Corollary. By using a random hash function (from a universalCorollary. By using a random hash function (from a universal
family), we get expected search time O(1 + n/M).family), we get expected search time O(1 + n/M).
Universal hash functions exists. Modulo prime is an example,Universal hash functions exists. Modulo prime is an example,
but not proved here.but not proved here.
45
Constructing Universal Hash FunctionsConstructing Universal Hash Functions
46
Universal Hash Functions by Dot ProductsUniversal Hash Functions by Dot Products
47
ProofProof
48
A Fact from Number TheoryA Fact from Number Theory
49
Proof (cont.)Proof (cont.)
50
Proof (cont.)Proof (cont.)
51
Perfect Hashing: Worst-Case O(1) LookupPerfect Hashing: Worst-Case O(1) Lookup
Universal hashing assures us that hashing has expected O(1)Universal hashing assures us that hashing has expected O(1)
search time, assuming n/M is at most a constant.search time, assuming n/M is at most a constant.
But what about worst case?But what about worst case?
There remains a small, but non-zero, prob. of unlucky randomThere remains a small, but non-zero, prob. of unlucky random
draw.draw.
A more sophisticated theory of Perfect Hashing shows thatA more sophisticated theory of Perfect Hashing shows that
one can even achieve O(1) worst-case result, using a 2-levelone can even achieve O(1) worst-case result, using a 2-level
hashing table.hashing table.
Fredman-Komlos-Szemeredi [JACM 1984]Fredman-Komlos-Szemeredi [JACM 1984]
52
Perfect Hashing: Worst-Case O(1) LookupPerfect Hashing: Worst-Case O(1) Lookup
53
Collisions at Level 2Collisions at Level 2
54
Achieving Zero Collisions at Level 2Achieving Zero Collisions at Level 2
55
Analysis of Space ComplexityAnalysis of Space Complexity
56
Bloom FiltersBloom Filters
In some applications, we need very compact data structureIn some applications, we need very compact data structure
for quick membership test: e. g. table of weak passwordsfor quick membership test: e. g. table of weak passwords
We are not interested in passwords themselves, so no needWe are not interested in passwords themselves, so no need
to store keys explicitly (as hash tables do)to store keys explicitly (as hash tables do)
Bloom Filters are a highly space efficient data structure forBloom Filters are a highly space efficient data structure for
this kind ofthis kind of finger-printing.finger-printing.
In other words, how compact a table will suffice if we justIn other words, how compact a table will suffice if we just
want a quick test for “Is x in S?”want a quick test for “Is x in S?”
57
A Motivating ApplicationA Motivating Application
Web CachingWeb Caching
 An ISP keeps several levels of caches for fast accessAn ISP keeps several levels of caches for fast access
 Upon a client’s request for data (image, movie etc)Upon a client’s request for data (image, movie etc)
 Check if data in local cache. If so, serve from cacheCheck if data in local cache. If so, serve from cache
 Otherwise, fetch data from remote serveOtherwise, fetch data from remote serve
 Remote server access is several orders of magnitude slowerRemote server access is several orders of magnitude slower
 Local access is therefore hugely preferableLocal access is therefore hugely preferable
 In fact, even if an occasional false positive occurs, the extraIn fact, even if an occasional false positive occurs, the extra
penalty in checking the local cache is negligiblepenalty in checking the local cache is negligible
58
Bloom Filters vs. HashingBloom Filters vs. Hashing
Bloom Filters sacrifice correctness for space efficiency:Bloom Filters sacrifice correctness for space efficiency:
 If key present, always find itIf key present, always find it
 But may say Yes when in fact key is not presentBut may say Yes when in fact key is not present
 The false positives problem.The false positives problem.
They can also be thought of as an extension of hashing withThey can also be thought of as an extension of hashing with
an interesting space-error-rate tradeoffan interesting space-error-rate tradeoff
 Universal hashing gets its power from choosing the hashUniversal hashing gets its power from choosing the hash
function at randomfunction at random
 Randomness as aid to foil an adversarial choice of keysRandomness as aid to foil an adversarial choice of keys
 Perfect Hash functions shows this can be achieved even inPerfect Hash functions shows this can be achieved even in
worst-case, but at the expense of added complexity.worst-case, but at the expense of added complexity.
 An alternative: multiple hash functions to each key.An alternative: multiple hash functions to each key.
 This allows the use of simple hash functionsThis allows the use of simple hash functions
 But minimizes the risk of a single hash functionBut minimizes the risk of a single hash function
59
Bloom Filter: formal setupBloom Filter: formal setup
Store an n-element set S from a large universe UStore an n-element set S from a large universe U
 n = |S| << |U|n = |S| << |U|
 Think of U as all possible web pages, and S as the setThink of U as all possible web pages, and S as the set
maintained in cache.maintained in cache.
We want to support “membership queries”We want to support “membership queries”
 Is a given element x currently in the set S?Is a given element x currently in the set S?
 If data structure returns No, then x definitely not in SIf data structure returns No, then x definitely not in S
 But the data structure can say Yes, even if x not in S, butBut the data structure can say Yes, even if x not in S, but
only with small probability.only with small probability.
 Membership and Insert operations should take O(1) time.Membership and Insert operations should take O(1) time.
 Delete can be handled as well.Delete can be handled as well.
60
Bloom Filters; DetailsBloom Filters; Details
A bloom filter is a bit vector B of m bitsA bloom filter is a bit vector B of m bits
Each key is mapped to B using k independent hash functionsEach key is mapped to B using k independent hash functions
The number of hash functions k is an optimization parameterThe number of hash functions k is an optimization parameter
To insert x into STo insert x into S
 Compute hCompute h11(x), h(x), h22(x), …, h(x), …, hkk(x)(x)
 Set B[hSet B[hii(x) = 1], for i=1,2,…, k.(x) = 1], for i=1,2,…, k.
To check for membership:To check for membership:
 Compute hCompute h11(x), h(x), h22(x), …, h(x), …, hkk(x)(x)
 Answer Yes ifAnswer Yes if B[hB[hii(x) = 1], for all i=1,2,…, k.(x) = 1], for all i=1,2,…, k.
 Otherwise answer No.Otherwise answer No.
61
Bloom Filters: an exampleBloom Filters: an example
62
Bloom Filters: analysisBloom Filters: analysis
63
Bloom Filters: analysisBloom Filters: analysis
 Prob. of 1 unset (0) bit is pProb. of 1 unset (0) bit is p
 Prob. that some non-member y gets flagged as presentProb. that some non-member y gets flagged as present
 When all k hash entries for y are set to 1When all k hash entries for y are set to 1
 (1 – p)(1 – p)kk
 ( 1 – e( 1 – e-kn/m-kn/m
))kk
64
Bloom Filters: analysisBloom Filters: analysis
65
Bloom Filters vs. HashingBloom Filters vs. Hashing
 Bloom Filters use multiple hash functions, and create aBloom Filters use multiple hash functions, and create a
k-bit finger-print for each input key.k-bit finger-print for each input key.
 If we store a n-key set in table of size m, BF tells theIf we store a n-key set in table of size m, BF tells the
optimal choice of k, and the resulting error rate.optimal choice of k, and the resulting error rate.
 Why is this better than a simple hash table of size m?Why is this better than a simple hash table of size m?
 Let’s compare.Let’s compare.
 Hash table gives a false positive when a collision occursHash table gives a false positive when a collision occurs
 The prob. of collision = (1 – 1/m)The prob. of collision = (1 – 1/m)nn
which is approx. 1 – ewhich is approx. 1 – e-n/m-n/m
66
Bloom Filter vs. Hash TablesBloom Filter vs. Hash Tables

More Related Content

What's hot

Open addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashingOpen addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashing
Haripritha
 

What's hot (20)

Hashing
HashingHashing
Hashing
 
Hashing
HashingHashing
Hashing
 
Hashing
HashingHashing
Hashing
 
Open addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashingOpen addressing &amp rehashing,extendable hashing
Open addressing &amp rehashing,extendable hashing
 
Open Addressing on Hash Tables
Open Addressing on Hash Tables Open Addressing on Hash Tables
Open Addressing on Hash Tables
 
Hashing Algorithm
Hashing AlgorithmHashing Algorithm
Hashing Algorithm
 
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
 
08 Hash Tables
08 Hash Tables08 Hash Tables
08 Hash Tables
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
Hashing data
Hashing dataHashing data
Hashing data
 
Hashing
HashingHashing
Hashing
 
linear probing
linear probinglinear probing
linear probing
 
Data Structure and Algorithms Hashing
Data Structure and Algorithms HashingData Structure and Algorithms Hashing
Data Structure and Algorithms Hashing
 
Hash table in data structure and algorithm
Hash table in data structure and algorithmHash table in data structure and algorithm
Hash table in data structure and algorithm
 
Open addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashingOpen addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashing
 
Ch17 Hashing
Ch17 HashingCh17 Hashing
Ch17 Hashing
 
Hashing
HashingHashing
Hashing
 
Hash Tables in data Structure
Hash Tables in data StructureHash Tables in data Structure
Hash Tables in data Structure
 
Hashing
HashingHashing
Hashing
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 

Viewers also liked

nhận thiết kế clip quảng cáo giá tốt
nhận thiết kế clip quảng cáo giá tốtnhận thiết kế clip quảng cáo giá tốt
nhận thiết kế clip quảng cáo giá tốt
raul110
 
문제는 한글이 잘 구현되는가?
문제는 한글이 잘 구현되는가?문제는 한글이 잘 구현되는가?
문제는 한글이 잘 구현되는가?
Choi Man Dream
 
CV Belinda Wahl 2015
CV Belinda Wahl 2015CV Belinda Wahl 2015
CV Belinda Wahl 2015
Belinda Wahl
 

Viewers also liked (20)

3.8 quicksort 04
3.8 quicksort 043.8 quicksort 04
3.8 quicksort 04
 
trabajo de cultural
trabajo de culturaltrabajo de cultural
trabajo de cultural
 
RESUME-ARITRA BHOWMIK
RESUME-ARITRA BHOWMIKRESUME-ARITRA BHOWMIK
RESUME-ARITRA BHOWMIK
 
2.4 mst prim &kruskal demo
2.4 mst  prim &kruskal demo2.4 mst  prim &kruskal demo
2.4 mst prim &kruskal demo
 
Top Forex Brokers
Top Forex BrokersTop Forex Brokers
Top Forex Brokers
 
160607 14 sw교육_강의안
160607 14 sw교육_강의안160607 14 sw교육_강의안
160607 14 sw교육_강의안
 
nhận thiết kế clip quảng cáo giá tốt
nhận thiết kế clip quảng cáo giá tốtnhận thiết kế clip quảng cáo giá tốt
nhận thiết kế clip quảng cáo giá tốt
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
문제는 한글이 잘 구현되는가?
문제는 한글이 잘 구현되는가?문제는 한글이 잘 구현되는가?
문제는 한글이 잘 구현되는가?
 
평범한 이야기[Intro: 2015 의기제]
평범한 이야기[Intro: 2015 의기제]평범한 이야기[Intro: 2015 의기제]
평범한 이야기[Intro: 2015 의기제]
 
4.2 bst 03
4.2 bst 034.2 bst 03
4.2 bst 03
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
4.1 webminig
4.1 webminig 4.1 webminig
4.1 webminig
 
1.9 b trees eg 03
1.9 b trees eg 031.9 b trees eg 03
1.9 b trees eg 03
 
2.4 mst kruskal’s
2.4 mst  kruskal’s 2.4 mst  kruskal’s
2.4 mst kruskal’s
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
1.9 b trees 02
1.9 b trees 021.9 b trees 02
1.9 b trees 02
 
CV Belinda Wahl 2015
CV Belinda Wahl 2015CV Belinda Wahl 2015
CV Belinda Wahl 2015
 
Salario minimo basico
Salario minimo basicoSalario minimo basico
Salario minimo basico
 

Similar to 4.4 hashing02

Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7
chidabdu
 
computer notes - Data Structures - 36
computer notes - Data Structures - 36computer notes - Data Structures - 36
computer notes - Data Structures - 36
ecomputernotes
 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec II
Sajid Marwat
 

Similar to 4.4 hashing02 (20)

Hash function
Hash functionHash function
Hash function
 
session 15 hashing.pptx
session 15   hashing.pptxsession 15   hashing.pptx
session 15 hashing.pptx
 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniya
 
Hashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT iHashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT i
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Hashing
HashingHashing
Hashing
 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
 
Lec5
Lec5Lec5
Lec5
 
Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7
 
18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set
 
computer notes - Data Structures - 36
computer notes - Data Structures - 36computer notes - Data Structures - 36
computer notes - Data Structures - 36
 
Radix and shell sort
Radix and shell sortRadix and shell sort
Radix and shell sort
 
Hash presentation
Hash presentationHash presentation
Hash presentation
 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec II
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent Hashing
 
03.01 hash tables
03.01 hash tables03.01 hash tables
03.01 hash tables
 
DA_02_algorithms.pptx
DA_02_algorithms.pptxDA_02_algorithms.pptx
DA_02_algorithms.pptx
 
hashing.pdf
hashing.pdfhashing.pdf
hashing.pdf
 
11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm
 

More from Krish_ver2

More from Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 
4.2 bst 02
4.2 bst 024.2 bst 02
4.2 bst 02
 
4.1 sequentioal search
4.1 sequentioal search4.1 sequentioal search
4.1 sequentioal search
 
3.9 external sorting
3.9 external sorting3.9 external sorting
3.9 external sorting
 
3.8 quicksort
3.8 quicksort3.8 quicksort
3.8 quicksort
 
3.8 quick sort
3.8 quick sort3.8 quick sort
3.8 quick sort
 
3.7 heap sort
3.7 heap sort3.7 heap sort
3.7 heap sort
 

Recently uploaded

Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
httgc7rh9c
 

Recently uploaded (20)

Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 

4.4 hashing02

  • 1. 1 Course OutlineCourse Outline Introduction and Algorithm Analysis (Ch. 2) Hash Tables: dictionary data structure (Ch. 5) Heaps: priority queue data structures (Ch. 6) Balanced Search Trees: general search structures (Ch. 4.1-4.5) Union-Find data structure (Ch. 8.1–8.5) Graphs: Representations and basic algorithms  Topological Sort (Ch. 9.1-9.2)  Minimum spanning trees (Ch. 9.5)  Shortest-path algorithms (Ch. 9.3.2) B-Trees: External-Memory data structures (Ch. 4.7) kD-Trees: Multi-Dimensional data structures (Ch. 12.6) Misc.: Streaming data, randomization
  • 2. 2 Data Structures for SetsData Structures for Sets Many applications deal with sets.  Compilers have symbol tables (set of vars, classes)  IP routers have IP addresses, packet forwarding rules  Web servers have set of clients, etc.  Dictionary is a set of words. A set is a collection of members  No repetition of members  Members themselves can be sets Examples  {x | x is a positive integer and x < 100}  {x | x is a CA driver with > 10 years of driving experience and 0 accidents in the last 3 years}  All webpages related containing the word Algorithms
  • 3. 3 Abstract Data TypesAbstract Data Types Set + Operations define an ADT.  A set + insert, delete, find  A set + ordering  Multiple sets + union, insert, delete  Multiple sets + merge  Etc. Depending on type of members and choice of operations, different implementations can have different asymptotic complexity.
  • 4. 4 DictionaryDictionary ADTsADTs Data structure with just 3 basic operations:  find (i): find item with key i  insert (i): insert i into the dictionary  remove (i): delete i  Just like words in a Dictionary Where do we use them:  Symbol tables for compiler  Customer records (access by name)  Games (positions, configurations)  Spell checkers  P2P systems (access songs by name), etc.
  • 5. 5 Naïve Method: Linked ListNaïve Method: Linked List Keep a linked list of the keys  insert (i): add to the head of list. Easy and fast O(1)  find (i): worst-case, search the whole list (linear)  remove (i): also linear in worst-case
  • 6. 6 Another Naïve Method: Direct MappingAnother Naïve Method: Direct Mapping Maintain an array (bit vector) for all possible keys  insert (i): set A[i] = 1  find (i): return A[i]  remove (i): set A[i] = 0 Student Records 1 2 3 8 9 13 14 Graduates Perm #
  • 7. 7 Another Naïve Method: Direct MappingAnother Naïve Method: Direct Mapping Maintain an array (bit vector) for all possible keys  insert (i): set A[i] = 1  find (i): return A[i]  remove (i): set A[i] = 0 All operations easy and fast O(1) What’s the drawback? Too much memory/space, and wasteful! The space of all possible IP addresses, variable names in a compiler is enormous!
  • 8. 8 Dictionary ADT: Naïve ImplementationsDictionary ADT: Naïve Implementations O(1) time possible but space-inefficient. Linked list space-efficient, but search-inefficient.  Insert is O(1) but find and delete are O(n). A sorted array does not help, even with ordered keys. The search becomes fast, but insert/delete take O(n). Balanced search trees (Chap. 4) work but take O(log n) time per operation, and complicated.
  • 9. 9 Towards an Efficient Data Structure: Hash TableTowards an Efficient Data Structure: Hash Table Formal Setup  The keys to be managed come from a known but very large set, called universe U  We can assume keys are integers {0, 1, …, |U|}  Non-numeric keys (strings, webpages) converted to numbers: Sum of ASCII values, first three characters The set of keys to be managed is S, a subset of U. The size of S is much smaller than U, namely, |S| << |U| We use n for |S|.
  • 10. 10 Hash TableHash Table Hash Tables use a Hash Function h to map each input key to a unique location in table of size M  h : U -> {0, 1, …, M-1}  hash function determines the hash table size.hash function determines the hash table size. Desiderata:  M should be small, O(n)  h should be easy to compute  Typical example: h(i) = i mod M
  • 11. 11 Hashing : the basic ideaHashing : the basic idea 9 10 20 39 4 14 8 Graduates Perm # (mod 9) Student Records
  • 12. 12 Hash Tables: IntuitionHash Tables: Intuition Unique location lets us find an item in O(1) time.  Each item is uniquely identified by a key Just check the location h(key) to find the item What can go wrong? Suppose we expect to have at most 100 keys in S  91, 2048, 329, 17, 689345, …. We create a table of size 100 and use the hash function h(key) = key mod 100 It is both fast and uses the ideal size table.
  • 13. 13 Hashing:Hashing: But what if all keys end with 00?  All keys will map to the same location  This is called a Collision in Hashing This motivates the 3This motivates the 3rdrd important property of hashingimportant property of hashing  A good hash function should evenly spread theA good hash function should evenly spread the keys to foil any special structure of inputkeys to foil any special structure of input  Hashing with mod 100 works fine if keys randomHashing with mod 100 works fine if keys random  Most data (e.g. program variables) are not randomMost data (e.g. program variables) are not random
  • 14. 14 Hashing:Hashing: A good hash function should evenly spread theA good hash function should evenly spread the keys to foil any special structure of inputkeys to foil any special structure of input Key idea behind hashing is to “simulate” theKey idea behind hashing is to “simulate” the randomnessrandomness through the hash functionthrough the hash function A good choice isA good choice is h(x) = x mod ph(x) = x mod p, for prime p, for prime p h(x) = (ax + b) mod ph(x) = (ax + b) mod p called pseudo-random hashcalled pseudo-random hash functionsfunctions
  • 15. 15 Hashing: The Basic SetupHashing: The Basic Setup Choose a pseudo-random hash function hChoose a pseudo-random hash function h  this automatically determines the hash table size.this automatically determines the hash table size. An item with key k is put atAn item with key k is put at location h(k)location h(k).. To find an item with key k, check location h(k).To find an item with key k, check location h(k). What to doWhat to do if more than one keys hash to theif more than one keys hash to the samesame value. This is calledvalue. This is called collisioncollision.. We will discuss two methods to handle collision:We will discuss two methods to handle collision:  Separate chaining  Open addressing
  • 16. 16 Maintain a list of all elements that hash to the same value Search using the hash function to determine which list to traverse Insert/deletion–once the “bucket” is found through Hash, insert and delete are list operations Separate chainingSeparate chaining class HashTable { …… private: unsigned int Hsize; List<E,K> *TheList; …… find(k,e) HashVal = Hash(k,Hsize); if (TheList[HashVal].Search(k,e)) then return true; else return false; 14 42 29 20 1 36 5623 16 24 31 17 7 0 1 2 3 4 5 6 7 8 9 10
  • 17. 17 Insertion: insert 53Insertion: insert 53 14 42 29 20 1 36 5623 16 24 31 17 7 0 1 2 3 4 5 6 7 8 9 10 53 = 4 x 11 + 9 53 mod 11 = 9 14 42 29 20 1 36 5623 16 24 53 17 7 0 1 2 3 4 5 6 7 8 9 10 31
  • 18. 18 Analysis of Hashing with ChainingAnalysis of Hashing with Chaining Worst case  All keys hash into the same bucket  a single linked list.  insert, delete, find take O(n) time.  A worst-case Theorem later Average case  Keys are uniformly distributed into buckets  Load Factor L = InputSize/HashTableSize  In a failed search, avg cost is L  In a successful search, avg cost is 1 + L/2
  • 19. 19 Open addressingOpen addressing If collision happens, alternative cells are tried until an empty cell is found. Linear probing : Try next available position 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7
  • 20. 20 Linear Probing (insert 12)Linear Probing (insert 12) 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12 = 1 x 11 + 1 12 mod 11 = 1 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12
  • 21. 21 Search with linear probing (Search 15)Search with linear probing (Search 15) 15 = 1 x 11 + 4 15 mod 11 = 4 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12 NOT FOUND !
  • 22. 22 // find the slot where searched item should be in int HashTable<E,K>::hSearch(const K& k) const { int HashVal = k % D; int j = HashVal; do {// don’t search past the first empty slot (insert should put it there) if (empty[j] || ht[j] == k) return j; j = (j + 1) % D; } while (j != HashVal); return j; // no empty slot and no match either, give up } bool HashTable<E,K>::find(const K& k, E& e) const { int b = hSearch(k); if (empty[b] || ht[b] != k) return false; e = ht[b]; return true; } Search with linear probingSearch with linear probing
  • 23. 23 Deletion in Hashing with Linear ProbingDeletion in Hashing with Linear Probing Since empty buckets are used to terminate search, standard deletion does not work. One simple idea is to not delete, but mark.  Insert: put item in first empty or marked bucket.  Search: Continue past marked buckets.  Delete: just mark the bucket as deleted. Advantage: Easy and correct. Disadvantage: table can become full with dead items. Avg. cost for successful searches ½ (1 + 1/(1 – L)) Failed search avg. cost more ½ (1 + 1/(1 – L)2 )
  • 24. 24 Deletion with linear probing:Deletion with linear probing: LAZY (Delete 9)LAZY (Delete 9) 9 = 0 x 11 + 9 9 mod 11 = 9 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12 FOUND ! 0 1 2 3 4 5 6 7 8 9 10 42 D 14 1 16 24 31 28 7 12
  • 25. 25 remove(j) { i = j; empty[i] = true; i = (i + 1) % D; // candidate for swapping while ((not empty[i]) and i!=j) { r = Hash(ht[i]); // where should it go without collision? // can we still find it based on the rehashing strategy? if not ((j<r<=i) or (i<j<r) or (r<=i<j)) then break; // yes find it from rehashing, swap i = (i + 1) % D; // no, cannot find it from rehashing } if (i!=j and not empty[i]) then { ht[j] = ht[i]; remove(i); } } Eager Deletion: fill holesEager Deletion: fill holes Remove and find replacement:  Fill in the hole for later searches
  • 26. 26 Eager Deletion Analysis (cont.)Eager Deletion Analysis (cont.)  If not full  After deletion, there will be at least two holes  Elements that are affected by the new hole are  Initial hashed location is cyclically before the new hole  Location after linear probing is in between the new hole and the next hole in the search order  Elements are movable to fill the hole Next hole in the search orderNew hole Initial hashed location Location after linear probing Next hole in the search order Initial hashed location
  • 27. 27 Eager Deletion Analysis (cont.)Eager Deletion Analysis (cont.) The important thing is to make sure that if a replacement (i) is swapped into deleted (j), we can still find that element. How can we not find it?  If the original hashed position (r) is circularly in between deleted and the replacement j r i j ri jr i i r Will not find i past the empty green slot! j i r i r Will find i
  • 28. 28 Quadratic ProbingQuadratic Probing Solves the clustering problem in Linear ProbingSolves the clustering problem in Linear Probing  Check H(x)  If collision occurs check H(x) + 1  If collision occurs check H(x) + 4  If collision occurs check H(x) + 9  If collision occurs check H(x) + 16  ...  H(x) + i2
  • 29. 29 Quadratic Probing (insert 12)Quadratic Probing (insert 12) 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12 = 1 x 11 + 1 12 mod 11 = 1 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12
  • 30. 30 Double HashingDouble Hashing When collision occurs use a second hash functionWhen collision occurs use a second hash function  Hash2 (x) = R – (x mod R)  R: greatest prime number smaller than table-size Inserting 12Inserting 12 H2(x) = 7 – (x mod 7) = 7 – (12 mod 7) = 2  Check H(x)  If collision occurs check H(x) + 2  If collision occurs check H(x) + 4  If collision occurs check H(x) + 6  If collision occurs check H(x) + 8  H(x) + i * H2(x)
  • 31. 31 Double Hashing (insert 12)Double Hashing (insert 12) 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12 = 1 x 11 + 1 12 mod 11 = 1 7 –12 mod 7 = 2 0 1 2 3 4 5 6 7 8 9 10 42 9 14 1 16 24 31 28 7 12
  • 32. 32 RehashingRehashing If table gets too full, operations will take too long. Build another table, twice as big (and prime).  Next prime number after 11 x 2 is 23 Insert every element again to this table Rehash after a percentage of the table becomes full (70% for example)
  • 33. 33 Collision FunctionsCollision Functions Hi(x)= (H(x)+i) mod B  Linear pobing Hi(x)= (H(x)+c*i) mod B (c > 1)  Linear probing with step-size = c Hi(x)= (H(x)+i2 ) mod B  Quadratic probing Hi(x)= (H(x)+ i * H2(x)) mod B
  • 34. 34 Analysis of Open HashingAnalysis of Open Hashing Effort of one Insert?  Intuitively – that depends on how full the hash is Effort of an average Insert? Effort to fill the Bucket to a certain capacity?  Intuitively – accumulated efforts in inserts Effort to search an item (both successful and unsuccessful)? Effort to delete an item (both successful and unsuccessful)?  Same effort for successful search and delete?  Same effort for unsuccessful search and delete?
  • 35. 35 Issues:Issues: What do we lose?What do we lose?  Operations that require ordering are inefficient  FindMax: O(n) O(log n) Balanced binary tree  FindMin: O(n) O(log n) Balanced binary tree  PrintSorted: O(n log n) O(n) Balanced binary tree What do we gain?What do we gain?  Insert: O(1) O(log n) Balanced binary tree  Delete: O(1) O(log n) Balanced binary tree  Find: O(1) O(log n) Balanced binary tree How to handle Collision?How to handle Collision?  Separate chaining  Open addressing
  • 36. 36 Theory of HashingTheory of Hashing First the bad news.First the bad news. TheoremTheorem:: ForFor anyany hash function h: U -> {0, 1, …, M}, therehash function h: U -> {0, 1, …, M}, there exists a set S of n keys thatexists a set S of n keys that all map to the same locationall map to the same location,, assuming |U| > nM.assuming |U| > nM. So, in the worst-case no hash function can avoid linear searchSo, in the worst-case no hash function can avoid linear search complexity!complexity! Proof.Proof. Take any hash function h you wish to considerTake any hash function h you wish to consider Map all the keys of U using h to the table of size MMap all the keys of U using h to the table of size M By the pigeon-hole principle, at least one table entry will have nBy the pigeon-hole principle, at least one table entry will have n keys.keys. Choose those n keys as input set S.Choose those n keys as input set S. Now h will maps the entire set S to a single location, for worst-caseNow h will maps the entire set S to a single location, for worst-case example of hashingexample of hashing..
  • 37. 37 Theory of HashingTheory of Hashing The negative result says thatThe negative result says that given a fixed hash function hgiven a fixed hash function h,, one can always construct a set S that is bad for h.one can always construct a set S that is bad for h. However, what we desire is something different:However, what we desire is something different: We are not choosing S; it is our (given) input.We are not choosing S; it is our (given) input. Can we find a good h for this particular S?Can we find a good h for this particular S? Theory shows that a random choice of h works.Theory shows that a random choice of h works.
  • 38. 38 Theory of Hashing: Birthday ParadoxTheory of Hashing: Birthday Paradox To appreciate the subtlety of hashing, first consider aTo appreciate the subtlety of hashing, first consider a puzzle:puzzle: the birthday paradoxthe birthday paradox.. Suppose birth days are chance events:Suppose birth days are chance events: date of birth is purely randomdate of birth is purely random any day of the year just as likely as anotherany day of the year just as likely as another
  • 39. 39 Theory of Hashing: Birthday ParadoxTheory of Hashing: Birthday Paradox What are the chances that in a group of 30 people, atWhat are the chances that in a group of 30 people, at least two have the same birthday?least two have the same birthday? How many people will be needed to have at least 50%How many people will be needed to have at least 50% chance of same birthday?chance of same birthday? It’s called a paradox because the answer appears to beIt’s called a paradox because the answer appears to be counter-intuitive.counter-intuitive. There are 365 different birthdays, so for 50% chance, youThere are 365 different birthdays, so for 50% chance, you expect at least 182 people.expect at least 182 people.
  • 40. 40 Birthday Paradox: the mathBirthday Paradox: the math Suppose 2 people in the room.Suppose 2 people in the room. What is the prob. that they have the same birthday?What is the prob. that they have the same birthday? Answer is 1/365.Answer is 1/365. All birthdays are equally likely, so B’s birthday falls on A’sAll birthdays are equally likely, so B’s birthday falls on A’s birthday 1 in 365 times.birthday 1 in 365 times. Now suppose there are k people in the room.Now suppose there are k people in the room. It’s more convenient to calculate the prob. X that no twoIt’s more convenient to calculate the prob. X that no two have the same birthday.have the same birthday. Our answer will be the (1 – X)Our answer will be the (1 – X)
  • 41. 41 Birthday ParadoxBirthday Paradox Define PDefine Pii = prob. that first i all have distinct birthdays= prob. that first i all have distinct birthdays For convenience, define p = 1/365For convenience, define p = 1/365 PP11 = 1.= 1. PP22 = (1 – p)= (1 – p) PP33 = (1 – p) * (1 – 2p)= (1 – p) * (1 – 2p) PPkk = (1 – p) * (1 – 2p) * …. * (1 – (k-1)p)= (1 – p) * (1 – 2p) * …. * (1 – (k-1)p) You can now verify that for k=23, PYou can now verify that for k=23, Pkk <= 0.4999<= 0.4999 That is, with just 23 people in the room, there is more thanThat is, with just 23 people in the room, there is more than 50% chance that two have the same birthday50% chance that two have the same birthday
  • 42. 42 Birthday Paradox: derivationBirthday Paradox: derivation Use 1 – x <= eUse 1 – x <= e-x-x , for all x, for all x Therefore, 1 – j*p <= eTherefore, 1 – j*p <= e-jp-jp Also, eAlso, exx + e+ eyy = e= ex+yx+y Therefore, PTherefore, Pkk <= e<= e(-p -2p -3p … -(k-1)p)(-p -2p -3p … -(k-1)p) PPkk <= e<= e-k(k-1)p/2-k(k-1)p/2 For k = 23, we have k(k-1)/2*365 = 0.69For k = 23, we have k(k-1)/2*365 = 0.69 ee-0.69-0.69 <= 0.4999<= 0.4999 Connection to Hashing:Connection to Hashing: Suppose n = 23, and hash table has size M = 365.Suppose n = 23, and hash table has size M = 365. 50% chance that 2 keys will land in the same bucket.50% chance that 2 keys will land in the same bucket.
  • 43. 43 Theory of Hashing: Universal Hash FunctionsTheory of Hashing: Universal Hash Functions AA setset of hash functions H is called universal if for any hashof hash functions H is called universal if for any hash function h chosen randomly from itfunction h chosen randomly from it Prob[h(x) = h(y)]Prob[h(x) = h(y)] <=<= 1/M, for any x, y in U1/M, for any x, y in U TheoremTheorem.. Suppose H is universal, S is an n-element subset ofSuppose H is universal, S is an n-element subset of U, and h a random hash function from H.U, and h a random hash function from H. The expected number of collisions is at most (n-1)/M forThe expected number of collisions is at most (n-1)/M for any x in S.any x in S.
  • 44. 44 Theory of Hashing: Universal Hash FunctionsTheory of Hashing: Universal Hash Functions TheoremTheorem.. Suppose H is universal, S is an n-element subset ofSuppose H is universal, S is an n-element subset of U, and h a random hash function from H.U, and h a random hash function from H. The expected number of collisions is at most (n-1)/M forThe expected number of collisions is at most (n-1)/M for any x in S.any x in S. Proof.Proof. Consider any x in S. For any other y, the prob. thatConsider any x in S. For any other y, the prob. that h(y) = h(x) is at most 1/M (by universal hashing)h(y) = h(x) is at most 1/M (by universal hashing) By linearity of expectation, the number of keys mapping toBy linearity of expectation, the number of keys mapping to h(x) is at most (n-1)/M.h(x) is at most (n-1)/M. Corollary. By using a random hash function (from a universalCorollary. By using a random hash function (from a universal family), we get expected search time O(1 + n/M).family), we get expected search time O(1 + n/M). Universal hash functions exists. Modulo prime is an example,Universal hash functions exists. Modulo prime is an example, but not proved here.but not proved here.
  • 45. 45 Constructing Universal Hash FunctionsConstructing Universal Hash Functions
  • 46. 46 Universal Hash Functions by Dot ProductsUniversal Hash Functions by Dot Products
  • 48. 48 A Fact from Number TheoryA Fact from Number Theory
  • 51. 51 Perfect Hashing: Worst-Case O(1) LookupPerfect Hashing: Worst-Case O(1) Lookup Universal hashing assures us that hashing has expected O(1)Universal hashing assures us that hashing has expected O(1) search time, assuming n/M is at most a constant.search time, assuming n/M is at most a constant. But what about worst case?But what about worst case? There remains a small, but non-zero, prob. of unlucky randomThere remains a small, but non-zero, prob. of unlucky random draw.draw. A more sophisticated theory of Perfect Hashing shows thatA more sophisticated theory of Perfect Hashing shows that one can even achieve O(1) worst-case result, using a 2-levelone can even achieve O(1) worst-case result, using a 2-level hashing table.hashing table. Fredman-Komlos-Szemeredi [JACM 1984]Fredman-Komlos-Szemeredi [JACM 1984]
  • 52. 52 Perfect Hashing: Worst-Case O(1) LookupPerfect Hashing: Worst-Case O(1) Lookup
  • 53. 53 Collisions at Level 2Collisions at Level 2
  • 54. 54 Achieving Zero Collisions at Level 2Achieving Zero Collisions at Level 2
  • 55. 55 Analysis of Space ComplexityAnalysis of Space Complexity
  • 56. 56 Bloom FiltersBloom Filters In some applications, we need very compact data structureIn some applications, we need very compact data structure for quick membership test: e. g. table of weak passwordsfor quick membership test: e. g. table of weak passwords We are not interested in passwords themselves, so no needWe are not interested in passwords themselves, so no need to store keys explicitly (as hash tables do)to store keys explicitly (as hash tables do) Bloom Filters are a highly space efficient data structure forBloom Filters are a highly space efficient data structure for this kind ofthis kind of finger-printing.finger-printing. In other words, how compact a table will suffice if we justIn other words, how compact a table will suffice if we just want a quick test for “Is x in S?”want a quick test for “Is x in S?”
  • 57. 57 A Motivating ApplicationA Motivating Application Web CachingWeb Caching  An ISP keeps several levels of caches for fast accessAn ISP keeps several levels of caches for fast access  Upon a client’s request for data (image, movie etc)Upon a client’s request for data (image, movie etc)  Check if data in local cache. If so, serve from cacheCheck if data in local cache. If so, serve from cache  Otherwise, fetch data from remote serveOtherwise, fetch data from remote serve  Remote server access is several orders of magnitude slowerRemote server access is several orders of magnitude slower  Local access is therefore hugely preferableLocal access is therefore hugely preferable  In fact, even if an occasional false positive occurs, the extraIn fact, even if an occasional false positive occurs, the extra penalty in checking the local cache is negligiblepenalty in checking the local cache is negligible
  • 58. 58 Bloom Filters vs. HashingBloom Filters vs. Hashing Bloom Filters sacrifice correctness for space efficiency:Bloom Filters sacrifice correctness for space efficiency:  If key present, always find itIf key present, always find it  But may say Yes when in fact key is not presentBut may say Yes when in fact key is not present  The false positives problem.The false positives problem. They can also be thought of as an extension of hashing withThey can also be thought of as an extension of hashing with an interesting space-error-rate tradeoffan interesting space-error-rate tradeoff  Universal hashing gets its power from choosing the hashUniversal hashing gets its power from choosing the hash function at randomfunction at random  Randomness as aid to foil an adversarial choice of keysRandomness as aid to foil an adversarial choice of keys  Perfect Hash functions shows this can be achieved even inPerfect Hash functions shows this can be achieved even in worst-case, but at the expense of added complexity.worst-case, but at the expense of added complexity.  An alternative: multiple hash functions to each key.An alternative: multiple hash functions to each key.  This allows the use of simple hash functionsThis allows the use of simple hash functions  But minimizes the risk of a single hash functionBut minimizes the risk of a single hash function
  • 59. 59 Bloom Filter: formal setupBloom Filter: formal setup Store an n-element set S from a large universe UStore an n-element set S from a large universe U  n = |S| << |U|n = |S| << |U|  Think of U as all possible web pages, and S as the setThink of U as all possible web pages, and S as the set maintained in cache.maintained in cache. We want to support “membership queries”We want to support “membership queries”  Is a given element x currently in the set S?Is a given element x currently in the set S?  If data structure returns No, then x definitely not in SIf data structure returns No, then x definitely not in S  But the data structure can say Yes, even if x not in S, butBut the data structure can say Yes, even if x not in S, but only with small probability.only with small probability.  Membership and Insert operations should take O(1) time.Membership and Insert operations should take O(1) time.  Delete can be handled as well.Delete can be handled as well.
  • 60. 60 Bloom Filters; DetailsBloom Filters; Details A bloom filter is a bit vector B of m bitsA bloom filter is a bit vector B of m bits Each key is mapped to B using k independent hash functionsEach key is mapped to B using k independent hash functions The number of hash functions k is an optimization parameterThe number of hash functions k is an optimization parameter To insert x into STo insert x into S  Compute hCompute h11(x), h(x), h22(x), …, h(x), …, hkk(x)(x)  Set B[hSet B[hii(x) = 1], for i=1,2,…, k.(x) = 1], for i=1,2,…, k. To check for membership:To check for membership:  Compute hCompute h11(x), h(x), h22(x), …, h(x), …, hkk(x)(x)  Answer Yes ifAnswer Yes if B[hB[hii(x) = 1], for all i=1,2,…, k.(x) = 1], for all i=1,2,…, k.  Otherwise answer No.Otherwise answer No.
  • 61. 61 Bloom Filters: an exampleBloom Filters: an example
  • 62. 62 Bloom Filters: analysisBloom Filters: analysis
  • 63. 63 Bloom Filters: analysisBloom Filters: analysis  Prob. of 1 unset (0) bit is pProb. of 1 unset (0) bit is p  Prob. that some non-member y gets flagged as presentProb. that some non-member y gets flagged as present  When all k hash entries for y are set to 1When all k hash entries for y are set to 1  (1 – p)(1 – p)kk  ( 1 – e( 1 – e-kn/m-kn/m ))kk
  • 64. 64 Bloom Filters: analysisBloom Filters: analysis
  • 65. 65 Bloom Filters vs. HashingBloom Filters vs. Hashing  Bloom Filters use multiple hash functions, and create aBloom Filters use multiple hash functions, and create a k-bit finger-print for each input key.k-bit finger-print for each input key.  If we store a n-key set in table of size m, BF tells theIf we store a n-key set in table of size m, BF tells the optimal choice of k, and the resulting error rate.optimal choice of k, and the resulting error rate.  Why is this better than a simple hash table of size m?Why is this better than a simple hash table of size m?  Let’s compare.Let’s compare.  Hash table gives a false positive when a collision occursHash table gives a false positive when a collision occurs  The prob. of collision = (1 – 1/m)The prob. of collision = (1 – 1/m)nn which is approx. 1 – ewhich is approx. 1 – e-n/m-n/m
  • 66. 66 Bloom Filter vs. Hash TablesBloom Filter vs. Hash Tables

Editor's Notes

  1. Hash function is x mod 11