2. Sequential Search
Looks for the target from the first to the last
element of the list
The later in the list the target occurs the longer
it takes to find it
Does not assume anything about the order of the
elements in the list, so it can be used with an
unsorted list
5. Worst-Case Analysis
If the target is in the last location, we look at all
of the elements to find it
If the target is not in the list, we need to look at
all of the elements to learn that
Therefore, the largest number of comparisons
we will do in this algorithm is N
6. Average-Case Analysis
If the search is always successful, there are N
places the target could be found
It will take 1 comparison to find the target in the
first location, 2 comparisons to find the target in
the second location, and so on
If each location is equally likely, we get:
2
11
)(A
1
N
i
N
N
N
i
7. Average-Case Analysis
If the search can fail, there are N places the target
could be found and 1 possibility when it’s not
found
If the target is not found, we do N comparisons
If each of these N+1 possibilities are equally
likely, we get:
2
2
1
1
)(A
1
N
iN
N
N
N
i
8. Binary Search
Used with a sorted list
First check the middle list element
If the target matches the middle element, we are
done
If the target is less than the middle element, the
key must be in the first half
If the target is larger than the middle element,
the key must be in the second half
10. Algorithm Review
Each comparison eliminates about half of the
elements of the list from consideration
If we begin with N = 2k
– 1 elements in the list,
there will be 2k–1
– 1 elements on the second
pass, and 2k–2
– 1 elements on the third pass
11. Worst-Case Analysis
In the worst case, we will either find the target
on the last pass, or not find the target at all
The last pass will have only one element left to
compare, which happens when
21
-1 = 1
If N = 2k
– 1, then there must be
k = lg(N+1) passes
12. Average-Case Analysis
If the search is always successful, there are N
places the target could be found
There is one place we check on the first pass,
two places we could check on the second pass,
and four places we could check on the third pass
14. Average-Case Analysis
In looking at the binary tree, we see that there
are i comparisons needed to find the 2i–1
elements on level i of the tree
For a list with N = 2k
-1 elements, there are k
levels in the binary tree
These two facts give us:
1)1lg(2*
1
)(A
1
1
Ni
N
N
k
i
i
15. Average-Case Analysis
If the search can fail sometimes, there are N
places the target could be found and N+1
possibilities when it is not found
In other words, if the missing key were added to
the list, it could be put at the beginning, between
any two elements, or at the end – a total of N+1
different places
16. Average-Case Analysis
The possibilities when the key is found are still
the same as before, and the new cases all take k
comparisons when N = 2k
– 1
This gives us:
2
1
)1lg(
2**1
12
1
)(A
1
1
N
ikN
N
N
k
i
i
17. Any Alternative to Binary Search?
Have we used all the knowledge we have about
finding an item in an ordered array? The answer is
maybe not.
If you were looking for Mr. Alfred Aaron in the
telephone book, would you open the book in the
middle and see whether Aaron was in the first half
or second half of the book? I think not.
18. Any Alternative to Binary Search?
Given the additional information of the upper and
lower limits of the values in a list we can
improve on a binary search by estimating the
most likely position of an element in the list.
This is called an interpolation search.
19. Interpolation Search
It proceeds like a binary search only the list
is divided each time according to our
estimate of where the key is situated.
Given a uniform distribution of keys,
interpolation search has an average case time
complexity of only lg(lg n).
20. Interpolation Search
There is another type of information we
normally use when searching a phone book
which is not used by binary search but it is used
by interpolation search:
where would you open the phone book if
you where looking for Mr. Alfred Aaron?
21. Interpolation Search
If the following conditions are true then interpolation
search may be better than binary search:
Each access is very expensive compared to a typical instruction,
e.g. the array is stored on a disk and each comparison requires a
disk access.
The data are not only sorted but also fairly uniformly
distributed, e.g. a phone book is fairly uniformly distributed, an
input like: [1,2,3,4,5,6,7,8,16,32,355,...] is not.
22. Interpolation Search
In this situation we are willing to
spend more time to make an accurate
guess where the item may be (instead
of always picking the mid point):
23. Interpolation Search
For example:
Array of 1000 items
The lowest item in the range is 1000
The highest item in range is 1,000,000
We are looking for the item of value 12,000
Then we expect to find the item around the 12th
position (always in the assumption that the items are
uniformly distributed). This is expressed by the
formula:
25. Interpolation Search
Calculation is more costly than the binary search
calculation
It needs to be done using floating point operations.
One iteration may be slower than the complete binary
search.
If the cost of this calculation is insignificant to the cost of
accessing an item, we only care about the number of
iterations.
26. Interpolation Search
In the worst case, when the numbers are not uniformly
distributed, the running time could be linear and all the
items might be examined.
If the items are reasonably uniformly distributed, the
running time has been demonstrated to be O(log log N)
(apply the logarithm twice in succession).
For example, for N = 4billion, log N is about 32 and
loglog N is roughly 5.
28. Hash tables are a common approach to the
storing/searching problem.
Hash Tables
29. What is a Hash Table ?
The simplest kind of hash table
is an array of records.
This example has 701 records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
An array of records
. . .
[ 700]
30. What is a Hash Table ?
Each record has a special
field, called its key.
In this example, the key is a
long integer field called
Number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
31. What is a Hash Table ?
The number might be a
person's identification
number, and the rest of the
record has information about
the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
32. What is a Hash Table ?
When a hash table is in use,
some spots contain valid
records, and other spots are
"empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
33. Inserting a New Record
In order to insert a new record,
the key must somehow be
converted to an array index.
The index is called the hash
value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
34. Inserting a New Record
Typical way create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?
35. Inserting a New Record
Typical way to create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?
3
36. Inserting a New Record
The hash value is used for the
location of the new record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
[3]
37. Inserting a New Record
The hash value is used for the
location of the new record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
38. Collisions
Here is another new record to
insert, with a hash value of 2.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
My hash
value is [2].
39. Collisions
This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
40. Collisions
This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
41. Collisions
This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
42. Collisions
This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
The new record goes
in the empty spot.
43. A Quiz
Where would you be placed in
this table, if there is no
collision? Use your social
security number or some other
favorite number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322Number 580625685 Number 701466868
. . .
44. Searching for a Key
The data that's attached to a
key can be found fairly
quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
45. Searching for a Key
Calculate the hash value.
Check that location of the array for
the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
46. Searching for a Key
Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
47. Searching for a Key
Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
48. Searching for a Key
Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Yes!
49. Searching for a Key
When the item is found, the
information can be copied to the
necessary location.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Yes!
50. Deleting a Record
Records may also be deleted from a hash table.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Please
delete me.
51. Deleting a Record
Records may also be deleted from a hash table.
But the location must not be left as an ordinary "empty
spot" since that could interfere with searches.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
52. Deleting a Record
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 233667136Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Records may also be deleted from a hash table.
But the location must not be left as an ordinary "empty
spot" since that could interfere with searches.
The location must be marked in some special way so that a
search can tell that the spot used to have something in it.
53. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Table
In the previous studies, all the searches had an
efficiency of at least O(logn)
Can it be faster?
For example, if a primary key contains values from 0 to
99, then a table (array) of size 100 would be enough for
each record to be directly located by the key value
which is the subscript of the table
If we can match all key values to different slots of a
table, we can make searching for a record very efficient
Hash Table: ideally to support search time O(1)
54. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash function and hash key
Key values may not be numeric or may be very large, but
we may transform the key into a value within a range
E.g., suppose that there are at most m (10000) records in
the file. Even if the key is in 8 digits, we may use a
function, e.g., key / 10000 to transform keys with 8 digits
to a value from 0-9999
Such a function which transforms a key into a value
which may further transform to a subscript of an array,
in a fixed length, is called hash function
The key being transformed is called the hash of key
55. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Perfect hash function
An ideal (perfect) hash
function transforms all
different hash of keys into
different subscripts of a table
When a file has a million
records, it is difficult to have
such a function
56. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Value and Hash Table index
A hash function transforms a key to a value which is
called hash value
This value may need to further be transformed to a
subscript of an array: hashValue%m where m is the
table size
The value which can map to a subscript of an array is
called hash table index
57. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash collision (clash)
When two hash of keys have the same hashed values,
it is called a hash collision or a hash clash
E.g., given a hash function h(key) = key and the hash
table size 1000, ==> hash table size: hi(h(1322)) = 1322
% 1000 = hi(h(2322)) = 2322 % 1000 = 322
That means both key 1322 and 2322 may attempt to
insert the record into the same position
58. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Resolving hash clashes
There are two basic techniques:
1. Chaining (Open hashing): Keys with the same hash
values will be linked together and a search process
should sequentially traverse all the items in the
linked list
2. Open Addressing (Closed Hashing) : Whenever
there is a clash, it will rehash – to find another slot
in the table
many techniques: e.g., linear probing, quadratic probing
60. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Open Addressing: Linear probing
Place the record in the next available position in the
array, i.e., rh(i) = i+1. E.g., (input: 2822, 1615, 2813, 3553,
4288, 2125, 8232)
0
1
2
3
4
5
6
7
8
9
2822
1615
2813
3553
4288
2125
8232
3553: h(3553)=3, rh(1)=4
2125: h(2125)=5, rh(1)=6
8232: h(8232)=2,
rh(1)=3,r(2)=4,
rh(3)=5, rh(4)=6, rh(5)=7
62. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash table re-sizing
When a hash table is full or nearly full, it requires
re-sizing to increase the size of the hash table
One of the methods is to take its first prime which is
twice as large as the old table size
For the previous table size 10 new table size is 23
and new hash function is h(key)=key%23
0 91 2 3 4 5 6 7 8 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
8
2
2
1
6
1
5
2
8
1
3
3
5
5
3
4
2
8
8
2
1
2
5
8
2
3
2
63. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Load Factor
To determine if a hash table is full or
nearly full, load factor is used
The value of the load factor is the ratio
of number of elements (m) to the slots
(n) of the table: m/n
64. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Acceptable ranges of load factor
For different addressing methods, the load factor
has different acceptable ranges
Closed addressing (chaining): about 2 to 4 – if key
values are well distributed in the table, it is expected
that every linked list has one or more nodes than the
load factor, i.e., every hit may require at most 4 to 6
visits
Open addressing: less than about 0.7 – it is the
percentage of slots being occupied – a larger percentage
may make a key to be rehashed many times – no more
O(1)
65. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Exercises
Ford’s 12:15.a-b++
hf(x) = x, m=11, data: 1, 13, 12, 53, 77, 29, 31,22
a) Construct the hash table by using linear probe addressing
Construct the table again by using rehash function:
index = (index + 5) % 11
b) Construct the hash table by using chaining with separate
lists; and also
Determine the load factors of the tables.
Depict the hash table after resize, the one resulting from
linear probing.
66. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions for integer data
A hash function usually produces a non-negative
value
A common hash function of numeric data is simply
hash(x) = abs(x)
Ford’s: hash(x) = x2 / 256 % 65536
67. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions for real numbers
Ford’s:
hash(x) = 0 if x = 0; otherwise
hashval = abs(2 * fabs(frexp(x,&exp)) -1);
where frexp() is a C library function which is used to
decompose num into two parts: a mantissa between 0.5 and
1 (returned by the function) and an exponent returned as
exp; and scientific notation works like this:
x = mantissa * (2 ^ exp)
(Reference: www.cppreference.com)
ICarnegie: hash(x) = floor(m * (frac(x * r)), where
typically, r can be the Golden Ratio (sqrt(5) – 1)/2 and
m is the table size
68. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash functions for strings
It is quite easy to think about converting each
character to its ASCII value (65-90 and 97-122) and
then accumulate its sum as the hash values – all
permutations of a word hash to the same slot!
The value of a character at different positions
multiplies a factor then sums up the result – making a
string similar to a number
when the factor is too small, it may not be significant
when the factor is too large, the resulting value would
overflow – only the last few characters become
accountable!
69. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Table vs BST
Timing for searching
Ideally, hash table has the complexity of O(1) while BST has a
complexity of O(log n)
However, it may require more than O(log n) if many keys are
clashed to the same slot. Even with the load factor, a hash table may
maintain an optimal time in searching but it takes very much time
when the hash table is required to re-size in order to maintain an
acceptable load factor
Sequential scan and range scan
The in-order traversal on a BST is a sequential scan, and range scan
is just a partial scan of the in-order traversal
Hash table does not easily support sequential scan on key values
unless the hash function maintains the order of the key values – such
a hash function may not distribute very well different key values into
different slots
70. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Coalesced hashing is a collision resolution method that
uses pointers to connect the elements of a synonym
chain.
Coalesced Hashing
• A hybrid of separate chaining and open addressing.
• Linked lists within the hash table handle collisions.
• This strategy is effective, efficient and very easy to
implement.
71. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Coalesced hashing obtains its name from what occurs when we attempt
to insert a record with a home address that is already occupied by a record
from a chain with a different home address.
Coalesced Hashing
This situation would occur, for example, if we attempted to insert
a record with a home address of s into the hash table.
What occurs is that the two chains with records having different
home addresses coalesce or grow together.
72. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
In figure to the right, the records with
keys X, D, and Y were inserted in the given
order into the hash table. A, B, C, and D
form one set of synonyms and X and Y form
another set.
When X is inserted into the table with
coalescing, it must be inserted as the end of
the chain that it is coalescing with.
Instead of needing only one probe to retrieve
X, three are needed. The greater the
coalescing the longer he probe chain will be,
and as a result, retrieval performance will be
degraded.
When record D is now added, it must be
inserted at the end of the coalesced chains;
we must move over record X from the other
chain then to locate D.
Coalesced Hashing
Synonym chain: with coalescing
(The shaded portion indicates portion
of the chain in which coalescing has
occurred, the thin line represents the
insertions on the synonym chain with r
as its home address. The thick line
represents the insertions on the chain
with s as its home address.)
73. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Coalesced Hashing
Coalesced hashing originated with Williams [1] and is also
referred to as direct chaining.
Algorithm for Coalesced Hashing
75. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Tables
Hash table:
Given a table T and a record x, with key (= symbol) and
satellite data, we need to support:
• Insert (T, x)
• Delete (T, x)
• Search(T, x)
We want these to be fast, but don’t care about sorting the
records
In this discussion we consider all keys to be (possibly
large) natural numbers
76. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Direct Addressing
Suppose:
The range of keys is 0..m-1
Keys are distinct
The idea:
Set up an array T[0..m-1] in which
• T[i] = x if x T and key[x] = i
• T[i] = NULL otherwise
This is called a direct-address table
• Operations take O(1) time!
77. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
The Problem With
Direct Addressing
Direct addressing works well when the range m of
keys is relatively small
But what if the keys are 32-bit integers?
Problem 1: direct-address table will have
232 entries, more than 4 billion
Problem 2: even if memory is not an issue, the time to
initialize the elements to NULL may be
Solution: map keys to smaller range 0..m-1
This mapping is called a hash function
78. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions
Next problem: collision
T
0
m - 1
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
k4
k2 k3
k1
k5
U
(universe of keys)
K
(actual
keys)
79. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Resolving Collisions
How can we solve the problem of collisions?
Solution 1: chaining
Solution 2: open addressing
80. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Open Addressing
Basic idea
To insert: if slot is full, try another slot, …, until an open
slot is found (probing)
To search, follow same sequence of probes as would be
used when inserting the element
• If reach element with correct key, return it
• If reach a NULL pointer, element is not in table
Good for fixed sets (adding but no deletion)
Example: spell checking
Table needn’t be much bigger than n
81. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Chaining
Chaining puts elements that hash to the same slot in a
linked list:
——
——
——
——
——
——
T
k4
k2
k3
k1
k5
U
(universe of keys)
K
(actual
keys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
82. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Chaining
How do we insert an element?
——
——
——
——
——
——
T
k4
k2
k3
k1
k5
U
(universe of keys)
K
(actual
keys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
83. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Chaining
——
——
——
——
——
——
T
k4
k2
k3
k1
k5
U
(universe of keys)
K
(actual
keys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
How do we delete an element?
Do we need a doubly-linked list for efficient delete?
84. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Chaining
How do we search for a element with a
given key?
——
——
——
——
——
——
T
k4
k2
k3
k1
k5
U
(universe of keys)
K
(actual
keys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
85. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Analysis of Chaining
Assume simple uniform hashing: each key in table is
equally likely to be hashed to any slot
Given n keys and m slots in the table: the
load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search
for a key?
86. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Analysis of Chaining
Assume simple uniform hashing: each key in table is
equally likely to be hashed to any slot
Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search
for a key? A: O(1+)
87. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Analysis of Chaining
Assume simple uniform hashing: each key in table is
equally likely to be hashed to any slot
Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search
for a key? A: O(1+)
What will be the average cost of a successful search?
88. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Analysis of Chaining
Assume simple uniform hashing: each key in table is
equally likely to be hashed to any slot
Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search
for a key? A: O(1+)
What will be the average cost of a successful search?
A: O(1 + /2) = O(1 + )
89. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Analysis of Chaining Continued
So the cost of searching = O(1 + )
If the number of keys n is proportional to the number of
slots in the table, what is ?
A: = O(1)
In other words, we can make the expected cost of
searching constant if we make constant
90. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Choosing A Hash Function
Clearly choosing the hash function well is crucial
What will a worst-case hash function do?
What will be the time to search in this case?
What are desirable features of the hash function?
Should distribute keys uniformly into slots
Should not depend on patterns in the data
91. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions:
The Division Method
h(k) = k mod m
In words: hash k into a table with m slots using the slot
given by the remainder of k divided by m
What happens to elements with adjacent
values of k?
What happens if m is a power of 2 (say 2P)?
What if m is a power of 10?
Upshot: pick table size m = prime number not too
close to a power of 2 (or 10)
92. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions:
The Multiplication Method
For a constant A, 0 < A < 1:
h(k) = m (kA - kA)
What does this term represent?
93. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions:
The Multiplication Method
For a constant A, 0 < A < 1:
h(k) = m (kA - kA)
Choose m = 2P
Choose A not too close to 0 or 1
Knuth: Good choice for A = (5 - 1)/2
Fractional part of kA
94. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions:
Worst Case Scenario
Scenario:
You are given an assignment to implement hashing
You will self-grade in pairs, testing and grading your
partner’s implementation
In a blatant violation of the honor code, your partner:
• Analyzes your hash function
• Picks a sequence of “worst-case” keys, causing your
implementation to take O(n) time to search
What’s an honest CS student to do?
95. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Hash Functions:
Universal Hashing
As before, when attempting to foil an malicious
adversary: randomize the algorithm
Universal hashing: pick a hash function randomly in a
way that is independent of the keys that are actually
going to be stored
Guarantees good performance on average, no matter what
keys adversary chooses
96. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Many suggestions have been made for reducing the
coalescing of probe chains and thereby lowering the number
of retrieval probes which in turn improves performance.
The variants may be classified in three ways:
Variants
• The table organization (whether or not a separate
overflow area is used).
• The manner of linking a colliding item into a chain.
• The manner of choosing unoccupied locations.
97. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Coalescing may be reduced by modifying the table organization.
Instead of allocating the entire table space for both overflow records and
home address records, the table is divided into a primary area and a
overflow area.
Primary
Overflow
(cellar)
Variants
• The primary area is the address space
that the hash function maps into.
• The overflow or cellar area contains
only overflow records.
• The address factor is the ratio of
primary area to the total table size –
Address Factor = primary area / total
table size
98. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
For a fixed amount of storage, as the address factor
decreases, the cellar size increases, which reduces the
coalescing but because the primary area becomes smaller, it
increases the number of collisions.
More collisions mean more items requiring multiple retrieval
probes.
Vitter [2] determined that an address factor of 0.86 yields
nearly optimal retrieval performance for most load factors.
Variants
99. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
LISCH
The algorithm given in slide 6 is called Late Insertion
Standard Coalesced Hashing (LISCH) since new records are
inserted at the end of a probe chain.
[
The ‘Standard’ in the name refers to the lack of a cellar.
The variant of that algorithm that uses a cellar is called
LICH, Late Insertion Coalesced Hashing.
Variants
100. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Another way of varying the insertion algorithm
Changing the way in which we choose a unoccupied location.
The unoccupied locations are always chosen from the bottom of the
storage area. But the no. of collisions is increased in this way.
Hsaio [3] suggest REISCH (‘R’ stands for ‘Random’), in which a random
unoccupied location for the new insertion is chosen.
REISCH gives only 1% improvement over EISCH.
BLISCH (‘B’ signifies ‘Bidirectional’) is another method of choosing the
overflow location for a collision insertion is to alternate the selection between the
top and bottom of the table.
In DCWC (Direct Chaining Without Coalescing), a record not stored at its home
address is moved.
Variants
101. Rossella Lau Lecture 10, DCO20105, Semester A,2005-6
Variants
Table 1: Mean number of probes for successful lookup (n = 997) for
variants of
Coalesced Hashing