2. Today’s Outline
• Project
– Rules of competition
– Comments?
•
– Probing continued
– Rehashing
– Extendible Hashing
– Case Study
3. Cost of a Database Query
I/O to CPU ratio is 300!
4. Extendible Hashing
• Hashing technique for huge data sets
– optimizes to reduce disk accesses
– each hash bucket fits on one disk block
– better than B-Trees if order is not important
• Table contains
– buckets, each fitting in one disk block, with the
data
– a directory that fits in one disk block used to hash
to the correct bucket
5. 001 010 011 110 111101
Extendible Hash Table
• Directory - entries labeled by k bits & pointer
to bucket with all keys starting with its bits
• Each block contains keys & data matching on
the first j ≤ k bits
000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
directory for k = 3
9. When Directory Gets Too Large
• Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
• Rehash
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much
data
• What if these don’t work?
– use a B-Tree to store the directory!
10. Rehash of Hashing
• Hashing is a great data structure for storing
unordered data that supports insert, delete & find
• Both separate chaining (open) and open addressing
(closed) hashing are useful
– separate chaining flexible
– closed hashing uses less storage, but performs badly with
load factors near 1
– extendible hashing for very large disk-based data
• Hashing pros and cons
+ very fast
+ simple to implement, supports insert, delete, find
- lazy deletion necessary in open addressing, can waste
storage
- does not support operations dependent on order: min, max,
range
11. Case Study
• Spelling dictionary
– 30,000 words
– static
– arbitrary(ish)
preprocessing time
• Goals
– fast spell checking
– minimal storage
• Practical notes
– almost all searches are
successful
– words average about 8
characters in length
– 30,000 words at 8
bytes/word is 1/4 MB
– pointers are 4 bytes
– there are many
regularities in the
structure of English
words
Why?
12. Solutions
• Solutions
– sorted array + binary search
– open hashing
– closed hashing + linear probing
What kind of hash
function should we
use?
13. Storage
• Assume words are strings & entries are
pointers to strings
Array +
binary search Open hashing
…
Closed hashing
(open addressing)
How many
pointers does
each use?
14. Analysis
Binary search
– storage: n pointers + words = 360KB
– time: log2n ≤ 15 probes/access, worst case
Open hashing
– storage: n + n/λ pointers + words (λ = 1 ⇒ 600KB)
– time: 1 + λ/2 probes/access, average (λ = 1 ⇒ 1.5)
Closed hashing
– storage: n/λ pointers + words (λ = 0.5 ⇒ 480KB)
– time: probes/access, average (λ = 0.5 ⇒ 1.5)( )
−
+
λ1
1
1
2
1
Which one should we use?
15. To Do
• Start Project III (only 1 week!)
• Start reading chapter 8 in the book
A whole lotta hashing going on!
We’ll get to the hashing – but first the project stuff – new assignment – communication
Rules for maze runners?
Comments?
This is a trace of how much time a simple database operation takes: this one lists all employees along with their job information, getting the employee and job information from separate databases.
The thing to notice is that disk access takes something like 100 times as much time as processing. I told you disk access was expensive!
BTW: the “index” in the picture is a B-Tree. - SQLServer
OK, extendible hashing takes hashing and makes it work for huge data sets.
The idea is very similar to B-Trees.
Notice that those are the hashed keys I’m representing in the table.
Actually, the buckets would store the original key/value pair.
Inserting if there’s room in the bucket is easy. Just like in B-Trees.
This takes 2 disk accesses. One to grab the directory, and one for the bucket.
But what if there isn’t room?
Then, as in B-Trees, we need to split a node.
In this case, that turns out to be easy because we have room in the directory for another node.
How many disk accesses does this take?
3: one for the directory, one for the old bucket, and one for the new bucket.
Now, even though there’s tons of space in the table, we have to expand the directory. Once we’ve done that, we’re back to the case on the previous page.
How many disk accesses does this take?
3! Just as for a normal split.
The only problem is if the directory ends up too large to fit in a block.
So, what does this say about the hash function we want for extendible hashing?
We want something parameterized so we can rehash if necessary to get the buckets evened out.
If the directory gets too large, we’re in trouble. In that case, we have two ways to fix it.
One puts more keys in each bucket.
The other evens out the buckets.
Neither is guaranteed to work.
Alright, let’s move on to the case study.
Here’s the situation.
Why are most searches successful?
Because most words will be spelled correctly.
Remember that we get arbitrary preprocessing time. What can we do with that?
We should use a universal hash function (or at least a parameterized hash function) because that will let use rehash in preprocess until we’re happy with the distribution.
Notice that all of these use the same amount of memory for the strings.
How many pointers does each one use?
N for array. Just store a pointer to each string.
n (pointer to each string) + n/lambda (null pointer ending each list). Note, I’m playing a little game here; without an optimization, this would actually be 2n + n/lambda.
N/lambda.
The answer, of course, is it depends!
However, given that we want fast checking, either hash table is better than the BST.
And, given that we want low memory usage, we should use the closed hash table.
Can anyone argue for the binary search?
Even if you don’t yet have a team you should absolutely start thinking about project III.
The code should not be hard to write, I think, but the algorithm may take you a bit
Next time, We’ll look at extendible hashing for huge data sets.