Successfully reported this slideshow.

6.01.hash tableintro


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

6.01.hash tableintro

  1. 1. Introduction to Hash Tables
  2. 2. Outline <ul><li>Discuss storing unordered data </li></ul><ul><li>Discuss IP addresses and domain names </li></ul><ul><li>Consider conversions between these two forms </li></ul><ul><li>Introduce the idea of hashing: </li></ul><ul><ul><li>Reducing O (ln( n )) operations to O (1) </li></ul></ul><ul><li>Consider some of the weaknesses </li></ul>
  3. 3. Background <ul><li>Given a set of data, suppose each data item is associated with a unique integer key in a particular range: </li></ul><ul><ul><li>UW Student ID Numbers decimal </li></ul></ul><ul><ul><li>Social Insurance Numbers decimal </li></ul></ul><ul><ul><li>IP Addresses binary </li></ul></ul><ul><li>We will use IP addresses as a model </li></ul><ul><li>Databases often assign unique primary keys be using an automatically incremented counter </li></ul>
  4. 4. IP Addresses <ul><li>Each computer communicating on a network using the Internet Protocol Version 4 (IPv4) has a unique IP address </li></ul><ul><ul><li>32 bits/4 bytes allowing four billion addresses </li></ul></ul><ul><li>Represented in human-readable format as four bytes </li></ul><ul><ul><li>The ECE web server has the IP address129.97.56.100 </li></ul></ul><ul><ul><li>The URL is valid </li></ul></ul><ul><ul><li>IP addresses are not easy to remember </li></ul></ul>
  5. 5. IP Addresses <ul><li>Domain names were introduced for humans </li></ul><ul><li>The Domain Name System (DNS) are hierarchical </li></ul><ul><ul><li>There are a limited number of top-level domains </li></ul></ul><ul><ul><li>Countries use ISO 1366 country codes </li></ul></ul><ul><ul><li>Responsibility for 2 nd -, 3 rd -, and lower-level domains are the responsibility of the parent </li></ul></ul>
  6. 6. IP Addresses <ul><li>The Unix command host translates between IP addresses and domain names </li></ul><ul><li>$ host </li></ul><ul><li> has address </li></ul><ul><li>$ host </li></ul><ul><li> has address </li></ul><ul><li>$ host </li></ul><ul><li> is an alias for </li></ul><ul><li> has address </li></ul><ul><li>$ host </li></ul><ul><li> is an alias for </li></ul><ul><li> is an alias for </li></ul><ul><li> has address </li></ul><ul><li> has address </li></ul><ul><li> has address </li></ul><ul><li> has address </li></ul>
  7. 7. IP Addresses <ul><li>The mapping is not one-to-one </li></ul><ul><ul><li>Some IP address have multiple domain names </li></ul></ul><ul><ul><li>Some domain names have multiple IP addresses </li></ul></ul><ul><li>This allows flexibility: </li></ul>
  8. 8. IP Addresses <ul><li>DNS allows a division of effort in name translation </li></ul><ul><ul><li>A DNS server in Korea ( kr ) does not need to know the IP address of </li></ul></ul><ul><li>Similarly, the University of Waterloo has control of names within its domain </li></ul><ul><ul><li>Any IP address starting with 129.97 belongs to UW </li></ul></ul><ul><ul><li>This gives UW 256 2 = 65535 IP addresses </li></ul></ul>
  9. 9. IP Addresses <ul><li>The translation of IP addresses to domain names is straight-forward: </li></ul><ul><ul><li>Use an array of size 65536 </li></ul></ul><ul><ul><li>E . g ., 90 × 256 + 209 = 23249 </li></ul></ul>Index Address Domain Name 23240 23241 23242 NO DOMAIN NAME 23243 23244 23245 23246 23247 23248 23249
  10. 10. IP Addresses <ul><li>In this example, the solution is clear: </li></ul><ul><ul><li>The array is fixed in size </li></ul></ul><ul><ul><li>The array is almost filled (dense) </li></ul></ul><ul><ul><ul><li>UW currently uses 65% of possible IP addresses </li></ul></ul></ul><ul><ul><li>The translation from IP address to domain name is  (1) </li></ul></ul><ul><li>There are two problems we will examine: </li></ul><ul><ul><li>1. What if the array is very sparse? </li></ul></ul><ul><ul><li>2. How do we go the other way? </li></ul></ul>
  11. 11. IP Addresses <ul><li>Problem #1: </li></ul><ul><ul><li>UW uses 65% of its 2 16 IP addresses </li></ul></ul><ul><ul><li>What if the relative number of used addresses is small? </li></ul></ul><ul><li>The new standard, IPv6, uses 128-bit addresses </li></ul><ul><ul><li>Allows 340 undecillion IP addresses </li></ul></ul><ul><ul><li>~7 IP addresses per cubic micrometre of atmosphere </li></ul></ul><ul><ul><li>Removes the need for Network Address Translations (NAT) </li></ul></ul><ul><li>Suppose UW is assigned 2 32 addresses </li></ul><ul><ul><li>We cannot have an array with four billion entries </li></ul></ul>
  12. 12. IP Addresses <ul><li>An array storing (domain name, IP address)-pairs sorted on the IP address would be slow to maintain </li></ul><ul><ul><li>The IP address is the key and associated with it is the string </li></ul></ul><ul><ul><li>Any new or deleted domain names would require O ( n ) work </li></ul></ul><ul><ul><li>Accessing an entry would require a binary search O (ln( n )) </li></ul></ul><ul><li>Using an AVL tree would still require that all operations require O (ln( n )) time </li></ul>
  13. 13. IP Addresses <ul><li>Can we do better than O (ln( n )) </li></ul><ul><ul><li>Can we get it down to  (1) ? </li></ul></ul><ul><li>Problem: </li></ul><ul><ul><li>So long as we require that the entries are sorted, we cannot do better than O (ln( n )) </li></ul></ul><ul><li>Do we care about the order? </li></ul><ul><ul><li>Do we need to know the IP address of the domain name which comes alphabetically after ? </li></ul></ul>
  14. 14. UW Student ID Numbers <ul><li>Let’s start with an easier example: </li></ul><ul><ul><li>Each UW student is assigned an 8-digit UW Student ID Number </li></ul></ul><ul><ul><li>Allocating an array of size 10 8 is wasteful </li></ul></ul><ul><ul><li>UW has only had ~10 5 students </li></ul></ul><ul><ul><li>There are only ~10 2 students in this class </li></ul></ul><ul><li>Suppose I want to store the grade associate with each student in this class </li></ul>
  15. 15. UW Student ID Numbers <ul><li>Solution: </li></ul><ul><ul><li>Allocate an array of 1000 bins </li></ul></ul><ul><ul><li>The bins are labeled 000, 001, ..., 999 </li></ul></ul><ul><ul><li>Store the mark of the student with number 20123 456 in bin 456 </li></ul></ul><ul><li>Benefits: </li></ul><ul><ul><li>Taking the modulo 1000 is  (1) </li></ul></ul><ul><ul><li>Modulo n is the remainder after dividing by n </li></ul></ul><ul><ul><li>Accessing an array entry is also  (1) </li></ul></ul><ul><ul><li>Only 100 students: 1 in 10 bins are filled </li></ul></ul>... ... ... ... 454 455 456 84 457 458 459 460 461 462 463 79 464 465
  16. 16. UW Student ID Numbers <ul><li>Problem: </li></ul><ul><ul><li>Multiple students may have the same last three digits </li></ul></ul><ul><ul><li>Assuming the last three digits are random: </li></ul></ul><ul><ul><ul><li>What is the probability that all students will a different set of last three digits? </li></ul></ul></ul><ul><ul><li>Answer: 0.5% </li></ul></ul><ul><li>Similar question: </li></ul><ul><ul><li>What is the probability that, in a group of 22 students, no two students share the same birthday? </li></ul></ul><ul><ul><li>Answer: 49% </li></ul></ul>
  17. 17. UW Student ID Numbers <ul><li>The process of mapping a number onto a smaller range is called hashing </li></ul><ul><li>The difficulty where multiple objects may hash to the same value is said to be a collision </li></ul><ul><li>Hash tables use a hash function together with a mechanism for dealing with collisions </li></ul>
  18. 18. IP Addresses <ul><li>Going back to our issue with UW being assigned 10 32 128-bit IP addresses </li></ul><ul><ul><li>Assume we will use at most 2 20 IP addresses </li></ul></ul><ul><ul><li>Allocate an array of size 2 20 </li></ul></ul><ul><ul><li>Define a hash function which deals with 128-bit inputs and maps it down to a number from 0, ..., 2 20 – 1 </li></ul></ul><ul><ul><li>Deal with collisions </li></ul></ul>
  19. 19. IP Addresses <ul><li>Problem #2: </li></ul><ul><ul><li>How do we go the other way? </li></ul></ul><ul><li>For example, given , how do we find the corresponding IP address today? </li></ul><ul><li>Even with the 32-bit IP address of today, this is still a significant problem </li></ul><ul><li>Same idea: </li></ul><ul><ul><li>Take a hash of the string which maps it to a value on the range 0, ..., 2 16 – 1 </li></ul></ul><ul><ul><li>Deal with collisions and look it up in an array of size 2 16 </li></ul></ul>
  20. 20. IP Addresses <ul><li>We will break the process into three independent steps: </li></ul>Object 32-bit integer Map to an index 0, ..., M – 1 Deal with collisions Techniques vary... Modulo, mid-square, multiplicative, Fibonacci Chained hash tables Open addressing Linear Probing Double Hashing
  21. 21. Summary <ul><li>Discuss storing unordered data </li></ul><ul><li>Discuss IP addresses and domain names </li></ul><ul><li>Consider conversions between these two forms </li></ul><ul><li>Introduce the idea of using a smaller array </li></ul><ul><ul><li>Converted “large” numbers into valid array indices </li></ul></ul><ul><ul><li>Reduces O (ln( n )) in arrays and AVL trees to to O (1) </li></ul></ul><ul><li>Discussed the issues with collisions </li></ul>
  22. 22. Usage Notes <ul><li>These slides are made publicly available on the web for anyone to use </li></ul><ul><li>If you choose to use them, or a part thereof, for a course at another institution, I ask only three things: </li></ul><ul><ul><li>that you inform me that you are using the slides, </li></ul></ul><ul><ul><li>that you acknowledge my work, and </li></ul></ul><ul><ul><li>that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides </li></ul></ul><ul><ul><li>Sincerely, </li></ul></ul><ul><ul><li>Douglas Wilhelm Harder, MMath </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>