CSE 326: Data Structures
Lecture #14
Bart Niswonger
Summer Quarter 2001
Today’s Outline
• Project
– Rules of competition
– Comments?
•
– Probing continued
– Rehashing
– Extendible Hashing
– Case Study
Cost of a Database Query
I/O to CPU ratio is 300!
Extendible Hashing
• Hashing technique for huge data sets
– optimizes to reduce disk accesses
– each hash bucket fits on one disk block
– better than B-Trees if order is not important
• Table contains
– buckets, each fitting in one disk block, with the
data
– a directory that fits in one disk block used to hash
to the correct bucket
001 010 011 110 111101
Extendible Hash Table
• Directory - entries labeled by k bits & pointer
to bucket with all keys starting with its bits
• Each block contains keys & data matching on
the first j ≤ k bits
000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
directory for k = 3
Inserting (easy case)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
insert(11011)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11100
11110
Splitting
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
insert(11000)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(3)
11000
11001
11011
(3)
11100
11110
Rehashing
01 10 1100
(2)
01101
(2)
10000
10001
10011
10111
(2)
11001
11110
insert(10010)
001 010 011 110 111101000 100
No room to
insert and no
adoption!
Expand
directory
Now, it’s just a normal split.
When Directory Gets Too Large
• Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
• Rehash
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much
data
• What if these don’t work?
– use a B-Tree to store the directory!
Rehash of Hashing
• Hashing is a great data structure for storing
unordered data that supports insert, delete & find
• Both separate chaining (open) and open addressing
(closed) hashing are useful
– separate chaining flexible
– closed hashing uses less storage, but performs badly with
load factors near 1
– extendible hashing for very large disk-based data
• Hashing pros and cons
+ very fast
+ simple to implement, supports insert, delete, find
- lazy deletion necessary in open addressing, can waste
storage
- does not support operations dependent on order: min, max,
range
Case Study
• Spelling dictionary
– 30,000 words
– static
– arbitrary(ish)
preprocessing time
• Goals
– fast spell checking
– minimal storage
• Practical notes
– almost all searches are
successful
– words average about 8
characters in length
– 30,000 words at 8
bytes/word is 1/4 MB
– pointers are 4 bytes
– there are many
regularities in the
structure of English
words
Why?
Solutions
• Solutions
– sorted array + binary search
– open hashing
– closed hashing + linear probing
What kind of hash
function should we
use?
Storage
• Assume words are strings & entries are
pointers to strings
Array +
binary search Open hashing
…
Closed hashing
(open addressing)
How many
pointers does
each use?
Analysis
Binary search
– storage: n pointers + words = 360KB
– time: log2n ≤ 15 probes/access, worst case
Open hashing
– storage: n + n/λ pointers + words (λ = 1 ⇒ 600KB)
– time: 1 + λ/2 probes/access, average (λ = 1 ⇒ 1.5)
Closed hashing
– storage: n/λ pointers + words (λ = 0.5 ⇒ 480KB)
– time: probes/access, average (λ = 0.5 ⇒ 1.5)( ) 





−
+
λ1
1
1
2
1
Which one should we use?
To Do
• Start Project III (only 1 week!)
• Start reading chapter 8 in the book
Coming Up
• Disjoint-set union-find ADT
• Quiz (Thursday)

4.4 hashing ext

  • 1.
    CSE 326: DataStructures Lecture #14 Bart Niswonger Summer Quarter 2001
  • 2.
    Today’s Outline • Project –Rules of competition – Comments? • – Probing continued – Rehashing – Extendible Hashing – Case Study
  • 3.
    Cost of aDatabase Query I/O to CPU ratio is 300!
  • 4.
    Extendible Hashing • Hashingtechnique for huge data sets – optimizes to reduce disk accesses – each hash bucket fits on one disk block – better than B-Trees if order is not important • Table contains – buckets, each fitting in one disk block, with the data – a directory that fits in one disk block used to hash to the correct bucket
  • 5.
    001 010 011110 111101 Extendible Hash Table • Directory - entries labeled by k bits & pointer to bucket with all keys starting with its bits • Each block contains keys & data matching on the first j ≤ k bits 000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 directory for k = 3
  • 6.
    Inserting (easy case) 001010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 insert(11011) 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11100 11110
  • 7.
    Splitting 001 010 011110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 insert(11000) 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (3) 11000 11001 11011 (3) 11100 11110
  • 8.
    Rehashing 01 10 1100 (2) 01101 (2) 10000 10001 10011 10111 (2) 11001 11110 insert(10010) 001010 011 110 111101000 100 No room to insert and no adoption! Expand directory Now, it’s just a normal split.
  • 9.
    When Directory GetsToo Large • Store only pointers to the items + (potentially) much smaller M + fewer items in the directory – one extra disk access! • Rehash + potentially better distribution over the buckets + fewer unnecessary items in the directory – can’t solve the problem if there’s simply too much data • What if these don’t work? – use a B-Tree to store the directory!
  • 10.
    Rehash of Hashing •Hashing is a great data structure for storing unordered data that supports insert, delete & find • Both separate chaining (open) and open addressing (closed) hashing are useful – separate chaining flexible – closed hashing uses less storage, but performs badly with load factors near 1 – extendible hashing for very large disk-based data • Hashing pros and cons + very fast + simple to implement, supports insert, delete, find - lazy deletion necessary in open addressing, can waste storage - does not support operations dependent on order: min, max, range
  • 11.
    Case Study • Spellingdictionary – 30,000 words – static – arbitrary(ish) preprocessing time • Goals – fast spell checking – minimal storage • Practical notes – almost all searches are successful – words average about 8 characters in length – 30,000 words at 8 bytes/word is 1/4 MB – pointers are 4 bytes – there are many regularities in the structure of English words Why?
  • 12.
    Solutions • Solutions – sortedarray + binary search – open hashing – closed hashing + linear probing What kind of hash function should we use?
  • 13.
    Storage • Assume wordsare strings & entries are pointers to strings Array + binary search Open hashing … Closed hashing (open addressing) How many pointers does each use?
  • 14.
    Analysis Binary search – storage:n pointers + words = 360KB – time: log2n ≤ 15 probes/access, worst case Open hashing – storage: n + n/λ pointers + words (λ = 1 ⇒ 600KB) – time: 1 + λ/2 probes/access, average (λ = 1 ⇒ 1.5) Closed hashing – storage: n/λ pointers + words (λ = 0.5 ⇒ 480KB) – time: probes/access, average (λ = 0.5 ⇒ 1.5)( )       − + λ1 1 1 2 1 Which one should we use?
  • 15.
    To Do • StartProject III (only 1 week!) • Start reading chapter 8 in the book
  • 16.
    Coming Up • Disjoint-setunion-find ADT • Quiz (Thursday)

Editor's Notes

  • #3 A whole lotta hashing going on! We’ll get to the hashing – but first the project stuff – new assignment – communication Rules for maze runners? Comments?
  • #4 This is a trace of how much time a simple database operation takes: this one lists all employees along with their job information, getting the employee and job information from separate databases. The thing to notice is that disk access takes something like 100 times as much time as processing. I told you disk access was expensive! BTW: the “index” in the picture is a B-Tree. - SQLServer
  • #5 OK, extendible hashing takes hashing and makes it work for huge data sets. The idea is very similar to B-Trees.
  • #6 Notice that those are the hashed keys I’m representing in the table. Actually, the buckets would store the original key/value pair.
  • #7 Inserting if there’s room in the bucket is easy. Just like in B-Trees. This takes 2 disk accesses. One to grab the directory, and one for the bucket. But what if there isn’t room?
  • #8 Then, as in B-Trees, we need to split a node. In this case, that turns out to be easy because we have room in the directory for another node. How many disk accesses does this take? 3: one for the directory, one for the old bucket, and one for the new bucket.
  • #9 Now, even though there’s tons of space in the table, we have to expand the directory. Once we’ve done that, we’re back to the case on the previous page. How many disk accesses does this take? 3! Just as for a normal split. The only problem is if the directory ends up too large to fit in a block. So, what does this say about the hash function we want for extendible hashing? We want something parameterized so we can rehash if necessary to get the buckets evened out.
  • #10 If the directory gets too large, we’re in trouble. In that case, we have two ways to fix it. One puts more keys in each bucket. The other evens out the buckets. Neither is guaranteed to work.
  • #12 Alright, let’s move on to the case study. Here’s the situation. Why are most searches successful? Because most words will be spelled correctly.
  • #13 Remember that we get arbitrary preprocessing time. What can we do with that? We should use a universal hash function (or at least a parameterized hash function) because that will let use rehash in preprocess until we’re happy with the distribution.
  • #14 Notice that all of these use the same amount of memory for the strings. How many pointers does each one use? N for array. Just store a pointer to each string. n (pointer to each string) + n/lambda (null pointer ending each list). Note, I’m playing a little game here; without an optimization, this would actually be 2n + n/lambda. N/lambda.
  • #15 The answer, of course, is it depends! However, given that we want fast checking, either hash table is better than the BST. And, given that we want low memory usage, we should use the closed hash table. Can anyone argue for the binary search?
  • #16 Even if you don’t yet have a team you should absolutely start thinking about project III. The code should not be hard to write, I think, but the algorithm may take you a bit
  • #17 Next time, We’ll look at extendible hashing for huge data sets.