Preparation Data Structures 09 hash tables

1. Data Structures Dictionaries Andres Mendez-Vazquez May 6, 2015 1 / 127

2. Images/cinvestav- Outline 1 Introduction 2 Operation in a ADT dictionary Add Remove GetValue Contains Iterators 3 Example 4 Implementation 5 Hash Tables Number of Keys Hash Functions Hashing By Division 6 Overﬂow Handling Open Addressing Chaining 2 / 127

3. Images/cinvestav- Dictionaries Deﬁnition The ADT dictionary—also called a map, table, or associative array—contains entries that each have two parts: A keyword—usually called a search key—such as an English word or a person’s name A value—such as a deﬁnition, an address, or a telephone number—associated with that key 3 / 127

6. Images/cinvestav- Example Dictionary with Duplicates Pairs are of the form (word, meaning). More than a single meaning (bolt, a threaded pin) (bolt, a crash of thunder) (bolt, to shoot forth suddenly) (bolt, a gulp) (bolt, a standard roll of cloth) etc. 4 / 127

13. Images/cinvestav- Thus We have possibly in a dictionary Sorted keys Duplicate keys 5 / 127

14. Images/cinvestav- Operation in the ADT dictionary Common operations with most databases insert adds a new entry to the dictionary, given a search key and associated value. delete removes an entry, given its associated search key retrieve retrieves the value associated with a given search key search sees whether the dictionary contains a given search key traverse It traverse all the search keys in the dictionary It traverse all the values in the dictionary 6 / 127

21. Images/cinvestav- In addition We have the following extra operations Detect whether a dictionary is empty Get the number of entries in the dictionary Remove all entries from the dictionary 7 / 127

25. Images/cinvestav- Speciﬁcations: Add Pseudocode add(key, value) Task It adds the pair (key , value) to the dictionary. Input and Output Input: key is an object search key, value is an associated object. Output: None. 9 / 127

28. Images/cinvestav- Example Adding something h(k) Hash Table (k,item) 10 / 127

30. Images/cinvestav- Speciﬁcations: Remove Pseudocode remove(key) Task It removes from the dictionary the entry that corresponds to a given search key. Input and Output Input: key is an object search key. Output: Returns either the value that was associated with the search key or null if no such object exists. 12 / 127

33. Images/cinvestav- Example Removing something h(k) Hash Table (k,item) 13 / 127

35. Images/cinvestav- Speciﬁcations: GetValue Pseudocode getValue(key) Task It retrieves from the dictionary the value that corresponds to a given search key. Input and Output Input: key is an object search key. Output: Returns either the value associated with the search key or null if no such object exists. 15 / 127

39. Images/cinvestav- Speciﬁcations: Contains Pseudocode contains(key) Task It sees whether any entry in the dictionary has a given search key. Input and Output Input: key is an object search key. Output: Returns true if an entry in the dictionary has key as its search key. 17 / 127

43. Images/cinvestav- Speciﬁcations: GetKeyIterator Pseudocode getKeyIterator() Task It creates an iterator that traverses all search keys in the dictionary. Input and Output Input: None. Output: Returns an iterator that provides sequential access to the search keys in the dictionary. 19 / 127

46. Images/cinvestav- Speciﬁcations: GetValueIterator Pseudocode getValueIterator() Task It creates an iterator that traverses all values in the dictionary. Input and Output Input: None. Output: Returns an iterator that provides sequential access to the values in the dictionary. 20 / 127

49. Images/cinvestav- Other Operations isEmpty() It sees whether the dictionary is empty. getSize() It gets the size of the dictionary. clear() It removes all entries from the dictionary. 21 / 127

52. Images/cinvestav- However, we have two scenarios Distinct search keys Case 1 You can refuse to add another key-value. Case 2 You can change the existing value associated with key to the new value. Then, you return the old value Duplicate search keys if the method add adds every given key-value entry to a dictionary The methods remove and getValue must deal with multiple entries that have the same search key. What do you remove or return!!! 22 / 127

57. Images/cinvestav- Interface We have the following interface import java . u t i l . I t e r a t o r ; p u b l i c i n t e r f a c e D i c t i o n a r y I n t e r f a c e <Key , Value> { p u b l i c Value add ( Key k , Value Item ) ; p u b l i c Value remove ( Key k ) ; p u b l i c Value getValue ( Key k ) ; p u b l i c boolean c o n t a i n s ( Key k ) ; p u b l i c I t e r a t o r <Key> g e t K e y I t e r a t o r ( ) ; p u b l i c I t e r a t o r <Value> g e t V a l u e I t e r a t o r ( ) ; p u b l i c boolean isEmpty ( ) ; p u b l i c i n t g e t S i z e ( ) ; p u b l i c void c l e a r ( ) ; } 23 / 127

58. Images/cinvestav- Where we can use this ADT? In the phone directory problem It is a directory that uses a name as the key and adds and returns a phone number For example Name Number Suzanne Nouveaux 401-555-1234 Andres Mendez-Vazquez 301-123-2345 ... ... 24 / 127

59. Images/cinvestav- Where we can use this ADT? In the phone directory problem It is a directory that uses a name as the key and adds and returns a phone number For example Name Number Suzanne Nouveaux 401-555-1234 Andres Mendez-Vazquez 301-123-2345 ... ... 24 / 127

60. Images/cinvestav- Thus, we have the following diagram A class Diagram TelephoneDirectory readFile(data) getPhoneNumber(name) PhoneDirectory Dictionary Name String 1 1 * * * * 25 / 127

61. Images/cinvestav- Basic Code We have something like this p u b l i c c l a s s TelephoneDirectory { p r i v a t e D i c t i o n a r y I n t e r f a c e <Name , String > phoneBook ; p u b l i c TelephoneDirectory () { phoneBook = new SortedDictionary <Name , String >(); } // end d e f a u l t c o n s t r u c t o r /∗∗ Reads a t e x t f i l e of names and telephone numbers∗∗/ p u b l i c void r e a d F i l e ( Scanner data ) { . . .} /∗∗ Gets the phone number of a given person . ∗/ p u b l i c S t r i n g getPhoneNumber (Name personName ) { . . . } // end getPhoneNumber } // end TelephoneDirectory 26 / 127

62. Images/cinvestav- Now, the Big Question It is a big one How do we implement this data structure? Possible ways Linear List Skip List Hash Tables ... 27 / 127

67. Images/cinvestav- First: Represent As A Linear List You have L = (e0, e1, ..., en−1) Where Each ei is a pair (key, element). We can use the following representations Array or linked representation. 28 / 127

70. Images/cinvestav- Array Representation Our classic array b c d e a We have then Operation in Array Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add at right end remove(theKey) O(size) 29 / 127

71. Images/cinvestav- Array Representation Our classic array b c d e a We have then Operation in Array Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add at right end remove(theKey) O(size) 29 / 127

72. Images/cinvestav- What if we sort the array? Sorted Keys b c d ea We have then Operation in Array Representation Complexity getValue(theKey) O(logsize) add(theKey, theItem) O(logsize) to ﬁnd duplicate O(size) to add remove(theKey) O(size) 30 / 127

73. Images/cinvestav- What if we sort the array? Sorted Keys b c d ea We have then Operation in Array Representation Complexity getValue(theKey) O(logsize) add(theKey, theItem) O(logsize) to ﬁnd duplicate O(size) to add remove(theKey) O(size) 30 / 127

74. Images/cinvestav- Unsorted Chain Our Structure b c d e a Complexity Operation in Chain Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add remove(theKey) O(size) 31 / 127

75. Images/cinvestav- Unsorted Chain Our Structure b c d e a Complexity Operation in Chain Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add remove(theKey) O(size) 31 / 127

76. Images/cinvestav- Sorted Chain Our Structure b c d ea Complexity Operation in Chain Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add remove(theKey) O(size) 32 / 127

77. Images/cinvestav- Sorted Chain Our Structure b c d ea Complexity Operation in Chain Representation Complexity getValue(theKey) O(size) add(theKey, theItem) O(size) to ﬁnd duplicate O(1) to add remove(theKey) O(size) 32 / 127

78. Images/cinvestav- Skip Lists: we will skip it - It is for an advance class of analysis of algorithms We have something like this Complexity Operation Complexity - Worst Case Complexity - Expected getValue(theKey) O(size) O(log size) add(theKey, theItem) O(size) O(logsize) remove(theKey) O(size) O(logsize) 33 / 127

79. Images/cinvestav- Skip Lists: we will skip it - It is for an advance class of analysis of algorithms We have something like this Complexity Operation Complexity - Worst Case Complexity - Expected getValue(theKey) O(size) O(log size) add(theKey, theItem) O(size) O(logsize) remove(theKey) O(size) O(logsize) 33 / 127

80. Images/cinvestav- We will concentrate our efforts in the Hash Tables Definition A hash table or hash map T is a data structure, most commonly an array, that uses a hash function to efficiently map certain identifiers of keys (e.g. person names) to associated values. Why? Operation in Array Representation Complexity - Worst Case Complexity - Expected getValue(theKey) O(size) O (1 + C) add(theKey, theItem) O(size) O (1 + C) remove(theKey) O(size) O (1 + C) 34 / 127

81. Images/cinvestav- We will concentrate our efforts in the Hash Tables Definition A hash table or hash map T is a data structure, most commonly an array, that uses a hash function to efficiently map certain identifiers of keys (e.g. person names) to associated values. Why? Operation in Array Representation Complexity - Worst Case Complexity - Expected getValue(theKey) O(size) O (1 + C) add(theKey, theItem) O(size) O (1 + C) remove(theKey) O(size) O (1 + C) 34 / 127

82. Images/cinvestav- Then Advantages They have the advantage of having a expected complexity of operations of O(1 + C) Still, be aware of C because this will change depending on which overﬂow policy you use... 35 / 127

83. Images/cinvestav- Then Advantages They have the advantage of having a expected complexity of operations of O(1 + C) Still, be aware of C because this will change depending on which overﬂow policy you use... 35 / 127

85. Images/cinvestav- You have two cases for this data structure First Small universe of keys. Second Large number of keys 37 / 127

86. Images/cinvestav- You have two cases for this data structure First Small universe of keys. Second Large number of keys 37 / 127

87. Images/cinvestav- When you have a small universe of keys, U We can do the following Key values are direct addresses in the array. Direct implementation or Direct-address tables. Operations 1 Direct-Address-Search(Table, key) return Table[key] 2 Direct-Address-Search(Table,key,value) Table[key]=value 3 Direct-Address-Delete(T, x) Table[key]=null 38 / 127

95. Images/cinvestav- When you have a large universe of keys, U Then Then, it is impractical to store a table of the size of |U|. You can use a especial function for mapping h : U→{0, 1, ..., m − 1} (1) 39 / 127

96. Images/cinvestav- When you have a large universe of keys, U Then Then, it is impractical to store a table of the size of |U|. You can use a especial function for mapping h : U→{0, 1, ..., m − 1} (1) 39 / 127

97. Images/cinvestav- Example Imagine that you have A 1D array (or table) table[0 : m − 1]. Thus h(k)is the home bucket for key k. Then Every dictionary pair (key, Item) is stored in its home bucket table[h[key]]. 40 / 127

100. Images/cinvestav- Example Push the following pairs in a hash table of size m = 8 (22,a), (33,c), (3,d), (73,e), (85,f). Hash function is key/11 Then, we have that 41 / 127

103. Images/cinvestav- What Can Go Wrong? What if we add? Where does (26,g) go? Then PROBLEM!!! Keys that have the same home bucket are synonyms. 22 and 26 are synonyms with respect to the hash function that is in use. This is known as collision or overﬂow. 42 / 127

107. Images/cinvestav- Collisions This is a problem We might try to avoid this by using a suitable hash function h. Idea Make appear to be “random” enough to avoid collisions altogether (Highly Improbable) or to minimize the probability of them. You still have the problem of collisions Possible Solutions to the problem: 1 Chaining 2 Open Addressing 43 / 127

112. Images/cinvestav- Other Issues First Issue The choice of the possible hash function. Second The collision handling method Third The size (number of buckets) at the hash table 44 / 127

116. Images/cinvestav- Hash Functions They have two parts 1 The conversion of the key into an integer in the case the key is not an integer. 2 The mapping to the home bucket. 46 / 127

119. Images/cinvestav- String To Integer Something Notable Each Java character is 2 bytes long. An int is 4 bytes long. We could have the following string s = “pt” We then do the following: 1 int answer = s.charAt(0); 2 answer = (answer<<16)+s.charAt(1) ; 47 / 127

124. Images/cinvestav- What about longer Strings? We can do the following process p u b l i c s t a t i c i n t i n t e g e r ( S t r i n g s ) { i n t lengt h = s . l ength ( ) ; // number of c h a r a c t e r s i n s i n t answer = 0; i f ( len gth % 2 == 1) {// lengt h i s odd answer = s . charAt ( le ngth − 1 ) ; length −−; } 48 / 127

125. Images/cinvestav- Cont... The rest of the code // lengt h i s now even f o r ( i n t i = 0; i < lengt h ; i += 2) {// do two c h a r a c t e r s at a time answer += s . charAt ( i ) ; answer += (( i n t ) s . charAt ( i + 1)) << 16; } r e t u r n ( answer < 0) ? −answer : answer ; } 49 / 127

126. Images/cinvestav- Analysis of hashing: Which hash function? Consider that: Good hash functions should maintain the property of simple uniform hashing! The keys have the same probability 1/m to be hashed to any bucket!!! A uniform hash function minimizes the likelihood of an overﬂow when keys are selected at random. 50 / 127

129. Images/cinvestav- Possible hash function when the keys are natural numbers The division method h(k) = k mod m. Good choices for m are primes not too close to a power of 2. 51 / 127

130. Images/cinvestav- Possible hash function when the keys are natural numbers The division method h(k) = k mod m. Good choices for m are primes not too close to a power of 2. 51 / 127

131. Images/cinvestav- What if... Question: What about something with keys in a normal distribution? 52 / 127

133. Images/cinvestav- Hashing By Division Universe of keys keySpace = all integers. Thus, we have that For every m, the number of integers that get mapped (hashed) into bucket i is approximately 232 /m. Properties The division method results in a uniform hash function when keySpace = all integers. 54 / 127

136. Images/cinvestav- However Problem In practice, keys tend to be correlated. Thus The choice of the divisor b aﬀects the distribution of home buckets. Then Because of this correlation, applications tend to have a bias towards keys that map into odd integers (or into even ones). 55 / 127

139. Images/cinvestav- Examples Odd number and m an even number Odd integers hash into odd home buckets 15%14 = 1, 3%14 = 3, 23%14 = 9 Even number and m an even number Even integers into even home buckets. 20%14 = 6, 30%14 = 2, 8%14 = 8 Properties The bias in the keys results in a bias toward either the odd or even home buckets. 56 / 127

142. Images/cinvestav- What if we use an Odd number? Odd number and m an odd number odd integers may hash into any home. 15%15 = 0, 3%15 = 3, 23%15 = 8 Even number and m an odd number Even integers may hash into any home. 20%15 = 5, 30%15 = 0, 8%15 = 8 Thus The bias in the keys does not result in a bias toward either the odd or even home buckets. 57 / 127

145. Images/cinvestav- Bias Something Notable The bias in the keys does not result in a bias toward either the odd or even home buckets. Then, we have We have a better chance of uniformly distributed home buckets. Thus So do not use an even divisor. 58 / 127

148. Images/cinvestav- Selecting The Divisor Another Problem Similar biased distribution of home buckets is seen, in practice, when the divisor is a multiple of prime numbers such as 3, 5, 7, . . . However The eﬀect of each prime divisor p of m decreases as p gets larger. Rules of Choosing m Ideally, choose m so that it is a prime number. Not to close to a power of 2. 59 / 127

151. Images/cinvestav- However Something Notable Even with this hash function, we can have problems Remember The Gaussian Keys... 60 / 127

152. Images/cinvestav- However Something Notable Even with this hash function, we can have problems Remember The Gaussian Keys... 60 / 127

153. Images/cinvestav- Java.util.HashTable Quite simple Simply uses a divisor that is an odd number. Simplify the implementation This simpliﬁes implementation because we must be able to resize the hash table as more pairs are put into the dictionary. Why? Array doubling, for example, requires you to go from a 1D array table whose length is m (which is odd) to an array whose length is 2m + 1 (which is also odd). 61 / 127

156. Images/cinvestav- A Better Solution: Universal Hashing Issues In practice, keys are not randomly distributed. Any ﬁxed hash function might yield retrieval O(n) time. Goal To ﬁnd hash functions that produce uniform random table indexes irrespective of the keys. Idea To select a hash function at random from a designed class of functions at the beginning of the execution. 62 / 127

159. Images/cinvestav- Hashing methods: Universal hashing Example Set of hash functions Choose a hash function randomly (At the beginning of the execution) HASH TABLE 63 / 127

160. Images/cinvestav- Example of Universal Hash Proceed as follows: Choose a primer number p large enough so that every possible key k is in the range [0, ..., p − 1] Zp = {0, 1, ..., p − 1}and Z∗ p = {1, ..., p − 1} Deﬁne the following hash function: ha,b(k) = ((ak + b) mod p) mod m, ∀a ∈ Z∗ p and b ∈ Zp The family of all such hash functions is: Hp,m = {ha,b : a ∈ Z∗ p and b ∈ Zp} Important a and b are chosen randomly at the beginning of execution. The class Hp,m of hash functions is universal. 64 / 127

168. Images/cinvestav- Example: Universal hash functions Example p = 977, m = 50, a and b random numbers ha,b(k) = ((ak + b) mod p) mod m 65 / 127

169. Images/cinvestav- Example of key distribution Example, mean = 488.5 and dispersion = 5 66 / 127

170. Images/cinvestav- Example with 10 keys Universal Hashing Vs Division Method 67 / 127

174. Images/cinvestav- Another Example: Matrix Method Then Let us say keys are u-bits long. Say the table size M is power of 2. an index is b-bits long with M = 2b. The h function Pick h to be a random b-by-u 0/1 matrix, and deﬁne h(x) = hx where after the inner product we apply mod 2 Example h b    1 0 0 0 0 1 1 1 1 1 1 0    u x     1 0 1 0      = h (x)   1 1 0    71 / 127

179. Images/cinvestav- Implementation of the column*vector mod 2 Code - SWAR-Popcount - Divide and Conquer // This works only i n 32 b i t s i n t product ( i n t row , i n t v e c t o r ){ i n t i = row & v e c t o r ; i = i − (( i >> 1) & 0x55555555 ) ; i = ( i & 0x33333333 ) + (( i >> 2) & 0x33333333 ) ; i = ( ( ( i + ( i >> 4)) & 0x0F0F0F0F ) ∗ 0x01010101 ) >> 24; r e t u r n i & 0x00000001 ; } 72 / 127

180. Images/cinvestav- Explanation Counting Duo-Bits A bit-duo (two neighboring bits) can be interpreted with bit 0 = a, and bit 1 = b as duo := 2b + a The duo population is popcnt(duo) := b + a This can be achieved by (2b + a) − (2b + a) ÷ 2 or (2b + a) − b 73 / 127

183. Images/cinvestav- We have then Something Notable i i div 2 popcnt(i) 00 00 00 01 00 01 10 01 01 11 01 10 We have then Only the lower bit is needed from i div 2 - and one do not has to worry about borrows from neighboring duos. We need to clear the upper bits Remember 5 in binary 0101 (Granularity Problem)... we use the 0 to remove the upper bits 74 / 127

186. Images/cinvestav- Thus, we have that Clearing Bits SWAR-wise, one needs to clear all "even" bits of the div 2 subtrahend to perform a 32-bit subtraction of all 16 duos: x = x - ((x >> 1) & 0x55555555); Note The popcount-result of the bit-duos still takes two bits. Now What? 75 / 127

189. Images/cinvestav- Counting Nibble-Bits We have The next step is to add the duo-counts to populations of four neighboring bits, the 8 nibble-counts, which may range from zero to four We can do this using 3 in binary 0x3=0011 76 / 127

190. Images/cinvestav- Counting Nibble-Bits We have The next step is to add the duo-counts to populations of four neighboring bits, the 8 nibble-counts, which may range from zero to four We can do this using 3 in binary 0x3=0011 76 / 127

191. Images/cinvestav- How? How many ones you can have in 4 bits 0 to 22 Thus, we only need 4 bits to store the result We create a counter for each of the two bits in four bits This is done using the mask - after How many ones can you have in two bits? From 0 to 2 you can use the lower 2 bits to represent this!!! This is the instruction i = (i & 0x33333333) + ((i >> 2) & 0x33333333); 77 / 127

196. Images/cinvestav- Example We have Old i new i 3 new i&3 0000 0000 0011 0000 0001 0001 0011 0001 0010 0001 0011 0001 0011 0010 0011 0010 0100 0100 0011 0000 0101 0101 0011 0001 1111 1010 0011 0010 new i>>2 3 new i>>2&3 0000 0011 0000 0000 0011 0000 0000 0011 0000 0000 0011 0000 0001 0011 0001 0001 0011 0001 0010 0011 0010 Final i 0000 0001 0001 0010 0001 0010 0100 78 / 127

197. Images/cinvestav- We have the following Now what? You already got the idea? Now it is about to get the byte-populations from two nibble-populations (8 bits) Do you remember your hexadecimal notation 0x0f = 00001111 How many ones do you have in 8 positions From 0 to 23, you only require 4 bits for this!!! 79 / 127

200. Images/cinvestav- Thus the next step It will get the counter for the number of ones in 8 bits (i + (i >> 4)) & 0x0f0f0f0f0f0f0f0f; 80 / 127

201. Images/cinvestav- Example We have Old x 2 bits 4 bits 11110000 10100000 01000000 11111111 10101010 01000100 Now we get 1 01000000 −→ (i >> 4) → 00000100 −→ i+(i >> 4) →01000100&00001111−→ 00000100 2 01000100 −→ (i >> 4) → 00000100 −→ i+(i >> 4) →01001000&00001111−→ 00001000 81 / 127

202. Images/cinvestav- Example We have Old x 2 bits 4 bits 11110000 10100000 01000000 11111111 10101010 01000100 Now we get 1 01000000 −→ (i >> 4) → 00000100 −→ i+(i >> 4) →01000100&00001111−→ 00000100 2 01000100 −→ (i >> 4) → 00000100 −→ i+(i >> 4) →01001000&00001111−→ 00001000 81 / 127

203. Images/cinvestav- We can keep doing this It is possible to keep going i = (i + (i >> 8)) & 0x00ff00ff Then i = (i + (i >> 16)) & 0x000000ff 82 / 127

204. Images/cinvestav- We can keep doing this It is possible to keep going i = (i + (i >> 8)) & 0x00ff00ff Then i = (i + (i >> 16)) & 0x000000ff 82 / 127

205. Images/cinvestav- With today fast 64 bit multiplications Something Notable one can multiply the vector of 8-byte-counts with 0x0101010101010101 to get the ﬁnal result in the most signiﬁcant byte, For this the code (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101)>>24 83 / 127

206. Images/cinvestav- With today fast 64 bit multiplications Something Notable one can multiply the vector of 8-byte-counts with 0x0101010101010101 to get the ﬁnal result in the most signiﬁcant byte, For this the code (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101)>>24 83 / 127

207. Images/cinvestav- Divide and Conquer This is the natural way we do many things We always attack smaller versions ﬁrst of the large one!!! 84 / 127

208. Images/cinvestav- Overflow Handling Overflow An overflow occurs when the home bucket for a new pair (key, element) is full. One strategy to handle overflow, small universe of keys Search the hash table in some systematic fashion for a bucket that is not full. Linear probing (linear open addressing). Quadratic probing. Random probing. The other strategy, a large universe of keys Eliminate overflows by permitting each bucket to keep a list of all pairs for which it is the home bucket. Array linear list. Chain. 85 / 127

215. Images/cinvestav- Open addressing Deﬁnition All the elements occupy the hash table itself. What is it? We systematically examine table slots until either we ﬁnd the desired element or we have ascertained that the element is not in the table. Advantages The advantage of open addressing is that it avoids pointers altogether. 87 / 127

218. Images/cinvestav- Insert in Open addressing Extended hash function to probe Instead of being ﬁxed in the order 0, 1, 2, ..., m − 1 with Θ (n) search time Extend the hash function to h : U × {0, 1, ..., m − 1} → {0, 1, ..., m − 1} This gives the probe sequence h(k, 0), h(k, 1), ..., h(k, m − 1) A permutation of 0, 1, 2, ..., m − 1 88 / 127

222. Images/cinvestav- Hashing methods in Open Addressing HASH-INSERT(T, k) 1 i = 0 2 repeat 3 j = h (k, i) 4 if T [j] == NIL 5 T [j] = k 6 return j 7 else i = i + 1 8 until i == m 9 error “Hash Table Overﬂow” 89 / 127

223. Images/cinvestav- Hashing methods in Open Addressing HASH-SEARCH(T,k) 1 i = 0 2 repeat 3 j = h (k, i) 4 if T [j] == k 5 return j 6 i = i + 1 7 until T [j] == NIL or i == m 8 return NIL 90 / 127

224. Images/cinvestav- Linear probing: Deﬁnition and properties Hash function Given an ordinary hash function h : 0, 1, ..., m − 1 → U for i = 0, 1, ..., m − 1, we get the extended hash function h(k, i) = h (k) + i mod m, (2) Sequence of probes Given key k, we ﬁrst probe T[h (k)], then T[h (k) + 1] and so on until T[m − 1]. Then, we wrap around T[0] to T[h (k) − 1]. Distinct probes Because the initial probe determines the entire probe sequence, there are m distinct probe sequences. 91 / 127

227. Images/cinvestav- Linear probing: Definition and properties Disadvantages Linear probing suffers of primary clustering. Long runs of occupied slots build up increasing the average search time. Clusters arise because an empty slot preceded by i full slots gets filled next with probability i+1 m . Long runs of occupied slots tend to get longer, and the average search time increases. 92 / 127

231. Images/cinvestav- Example: Linear Probing – Get And Put Constraints divisor = m (number of buckets) = 17. Home bucket = key % 17. Then Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 We have 93 / 127

234. Images/cinvestav- Linear Probing – Remove Example remove(0) Compact Cluster Search cluster for pair (if any) to ﬁll vacated bucket. 94 / 127

237. Images/cinvestav- Linear Probing – remove(34) Example remove(34) Compact Cluster Search cluster for pair (if any) to ﬁll vacated bucket. 95 / 127

240. Images/cinvestav- Linear Probing – remove(29) Example Compact Cluster Search cluster for pair (if any) to ﬁll vacated bucket. 96 / 127

241. Images/cinvestav- Linear Probing – remove(29) Example Compact Cluster Search cluster for pair (if any) to ﬁll vacated bucket. 96 / 127

242. Images/cinvestav- Code for Removing We have the following p u b l i c void remove ( key ){ i n t positionsChecked = 1; i n t i = FindSlot ( Key ) ; i f ( Table [ i ] == n u l l ) r e t u r n ; // key i s not i n the t a b l e j = i ; wh il e ( positionsChecked <= Table . len gth ){ j = ( j +1) % Table . leng th ; i f ( Table [ j ] == n u l l ) break ; k = Hashing ( Table [ j ] . key ) ; i f ( i < j && ( k <= i | | k > j )) | | ( j < i && ( k <= i && k > j )) { Table [ i ] = Table [ j ] ; i = j ; } positionChecked++; } Table [ i ] = n u l l ; } 97 / 127

243. Images/cinvestav- Explanation First For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before ﬁnding the record). Second k is the raw hash where the record at j would naturally land in the hash table if there were no collisions. Thus This test is asking if the record at j is invalidly positioned with respect to the required properties of a cluster now that i is vacant. 98 / 127

246. Images/cinvestav- Case 1 We have the following Case 1 i j k k we have i < j If i < k ≤ j then moving j to the i position will be incorrect... Why? 99 / 127

247. Images/cinvestav- Case 2 We have the following Case 2 ij k We have j < i If k ≤ j < or i < k then moving j to the i position will be incorrect... Why? 100 / 127

248. Images/cinvestav- Complexity We have then Worst-case get/put/remove time is Θ(n), where n is the number of pairs in the table. Something Notable This happens when all pairs are in the same cluster. 101 / 127

249. Images/cinvestav- Complexity We have then Worst-case get/put/remove time is Θ(n), where n is the number of pairs in the table. Something Notable This happens when all pairs are in the same cluster. 101 / 127

250. Images/cinvestav- Explaining α We have that α= loading density = (number of pairs)/m. In the example α = 12/17 We have the following terms to explain complexity in Open Addressing Sn = expected number of buckets examined in a successful search when n is large. Un= expected number of buckets examined in a unsuccessful search when n is large 102 / 127

254. Images/cinvestav- Expected Performance in Open Addressing, 0 ≤ α ≤ 1 We have for unsuccessful search Un = O 1 + 1 1 − α (3) We have for successful search Sn = O 1 + 1 α ln 1 1 − α (4) 103 / 127

255. Images/cinvestav- Expected Performance in Open Addressing, 0 ≤ α ≤ 1 We have for unsuccessful search Un = O 1 + 1 1 − α (3) We have for successful search Sn = O 1 + 1 α ln 1 1 − α (4) 103 / 127

256. Images/cinvestav- Thus, for diﬀerent α s We have using uniform keys!!! α Sn Un 0.50 1.5 2.5 0.75 2.5 8.5 0.90 5.5 50.5 Recommendation α ≤ 0.75 is recommended. 104 / 127

257. Images/cinvestav- Thus, for diﬀerent α s We have using uniform keys!!! α Sn Un 0.50 1.5 2.5 0.75 2.5 8.5 0.90 5.5 50.5 Recommendation α ≤ 0.75 is recommended. 104 / 127

258. Images/cinvestav- Hash Table Design For example If performance requirements are given, determine maximum permissible loading density. We want a successful search to make no more than 5 compares (expected) 4 = 1 + 1 1−α‘ α = 3/4 We want an unsuccessful search to make no more than 6 compares (expected). 6 = 1 + 1 α log 1 1−α α ≈ 0.964 105 / 127

261. Images/cinvestav- Use the minimum α for your trigger of doubling the array!!! Thus, we have αf ≤ min {3/4, 0.964} 106 / 127

262. Images/cinvestav- Hash Table Design Dynamic resizing of table. Whenever loading density exceeds threshold, rehash into a table of approximately twice the current size. 107 / 127

263. Images/cinvestav- But , even with a nice division method!!! Example using keys uniformly distributed It was generated using the division method Then 108 / 127

264. Images/cinvestav- But , even with a nice division method!!! Example using keys uniformly distributed It was generated using the division method Then 108 / 127

265. Images/cinvestav- Example Example using Gaussian keys It was generated using the division method Then 109 / 127

266. Images/cinvestav- Example Example using Gaussian keys It was generated using the division method Then 109 / 127

268. Images/cinvestav- Linear List Of Synonyms Thus Each bucket keeps a linear list of all pairs for which it is the home bucket. The linear list may or may not be sorted by key. The linear list may be an array linear list or a chain. 111 / 127

269. Images/cinvestav- Collision Handling: Chaining A Possible Solution Insert the elements that hash to the same slot into a linked list. U (Universe of Keys) 112 / 127

270. Images/cinvestav- Example Sorted Chains Add to a hash table with m = 11 Put in pairs whose keys are 6, 17, 12, 23, 28, 5, 16, 3, 8 So, we have Home bucket = key % 11. 113 / 127

271. Images/cinvestav- Example Sorted Chains Add to a hash table with m = 11 Put in pairs whose keys are 6, 17, 12, 23, 28, 5, 16, 3, 8 So, we have Home bucket = key % 11. 113 / 127

272. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28, 5, 16, 3, 8 114 / 127

273. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28, 5, 16, 3 8 115 / 127

274. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28, 5, 16 8 3 116 / 127

275. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28, 5 8 3 16 117 / 127

276. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28, 5 8 3 16 118 / 127

277. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23, 28 8 3 5 16 119 / 127

278. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12, 23 8 3 5 16 28 120 / 127

279. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17, 12 8 3 5 16 28 23 121 / 127

280. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6, 17 8 3 5 16 28 12 23 122 / 127

281. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 6 8 3 5 16 17 12 23 28 123 / 127

282. Images/cinvestav- Example The Table 0 1 2 3 4 5 6 7 8 9 10 8 3 5 16 17 12 23 286 124 / 127

283. Images/cinvestav- Do You Remember This? Universal Hashing Vs Division Method 125 / 127

284. Images/cinvestav- Expected Complexity of Hash Table under Chaining We have for unsuccessful search Un = O (1 + α) (5) We have for successful search Sn = O (1 + α) (6) 126 / 127

285. Images/cinvestav- Expected Complexity of Hash Table under Chaining We have for unsuccessful search Un = O (1 + α) (5) We have for successful search Sn = O (1 + α) (6) 126 / 127

286. Images/cinvestav- Now, Back to the real world java.util.Hashtable It uses unsorted chains. It uses a default initial m = divisor = 101 It uses a default α ≤ 0.75 When loading density exceeds a max permissible threshold, It rehash with new m = 2m+1. 127 / 127

Preparation Data Structures 09 hash tables

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Preparation Data Structures 09 hash tables

Similar to Preparation Data Structures 09 hash tables (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

Preparation Data Structures 09 hash tables