• Like
6.01.hash tableintro
Upcoming SlideShare
Loading in...5
×

6.01.hash tableintro

  • 91 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
91
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Hash Tables
  • 2. Outline
    • Discuss storing unordered data
    • Discuss IP addresses and domain names
    • Consider conversions between these two forms
    • Introduce the idea of hashing:
      • Reducing O (ln( n )) operations to O (1)
    • Consider some of the weaknesses
  • 3. Background
    • Given a set of data, suppose each data item is associated with a unique integer key in a particular range:
      • UW Student ID Numbers decimal
      • Social Insurance Numbers decimal
      • IP Addresses binary
    • We will use IP addresses as a model
    • Databases often assign unique primary keys be using an automatically incremented counter
  • 4. IP Addresses
    • Each computer communicating on a network using the Internet Protocol Version 4 (IPv4) has a unique IP address
      • 32 bits/4 bytes allowing four billion addresses
    • Represented in human-readable format as four bytes
      • The ECE web server has the IP address129.97.56.100
      • The URL http://129.97.56.100/ is valid
      • IP addresses are not easy to remember
  • 5. IP Addresses
    • Domain names were introduced for humans
    • The Domain Name System (DNS) are hierarchical
      • There are a limited number of top-level domains
      • Countries use ISO 1366 country codes
      • Responsibility for 2 nd -, 3 rd -, and lower-level domains are the responsibility of the parent
  • 6. IP Addresses
    • The Unix command host translates between IP addresses and domain names
    • $ host uwaterloo.ca
    • uwaterloo.ca has address 129.97.128.40
    • $ host ece.uwaterloo.ca
    • ece.uwaterloo.ca has address 129.97.56.100
    • $ host www.uwaterloo.ca
    • www.uwaterloo.ca is an alias for info.uwaterloo.ca.
    • info.uwaterloo.ca has address 129.97.128.40
    • $ host www.google.ca
    • www.google.ca is an alias for www.google.com.
    • www.google.com is an alias for www.l.google.com.
    • www.l.google.com has address 72.14.205.99
    • www.l.google.com has address 72.14.205.103
    • www.l.google.com has address 72.14.205.104
    • www.l.google.com has address 72.14.205.147
  • 7. IP Addresses
    • The mapping is not one-to-one
      • Some IP address have multiple domain names
      • Some domain names have multiple IP addresses
    • This allows flexibility:
  • 8. IP Addresses
    • DNS allows a division of effort in name translation
      • A DNS server in Korea ( kr ) does not need to know the IP address of uwaterloo.ca
    • Similarly, the University of Waterloo has control of names within its domain
      • Any IP address starting with 129.97 belongs to UW
      • This gives UW 256 2 = 65535 IP addresses
  • 9. IP Addresses
    • The translation of IP addresses to domain names is straight-forward:
      • Use an array of size 65536
      • E . g ., 90 × 256 + 209 = 23249
    Index Address Domain Name 23240 129.97.90.200 sidicsem.uwaterloo.ca 23241 129.97.90.201 watdist8.uwaterloo.ca 23242 129.97.90.202 NO DOMAIN NAME 23243 129.97.90.203 secure0.uwaterloo.ca 23244 129.97.90.204 msma.uwaterloo.ca 23245 129.97.90.205 ehab0.uwaterloo.ca 23246 129.97.90.206 calliope1.uwaterloo.ca 23247 129.97.90.207 calliope2.uwaterloo.ca 23248 129.97.90.208 dsip-lpt.uwaterloo.ca 23249 129.97.90.209 churchill.uwaterloo.ca
  • 10. IP Addresses
    • In this example, the solution is clear:
      • The array is fixed in size
      • The array is almost filled (dense)
        • UW currently uses 65% of possible IP addresses
      • The translation from IP address to domain name is  (1)
    • There are two problems we will examine:
      • 1. What if the array is very sparse?
      • 2. How do we go the other way?
  • 11. IP Addresses
    • Problem #1:
      • UW uses 65% of its 2 16 IP addresses
      • What if the relative number of used addresses is small?
    • The new standard, IPv6, uses 128-bit addresses
      • Allows 340 undecillion IP addresses
      • ~7 IP addresses per cubic micrometre of atmosphere
      • Removes the need for Network Address Translations (NAT)
    • Suppose UW is assigned 2 32 addresses
      • We cannot have an array with four billion entries
  • 12. IP Addresses
    • An array storing (domain name, IP address)-pairs sorted on the IP address would be slow to maintain
      • The IP address is the key and associated with it is the string
      • Any new or deleted domain names would require O ( n ) work
      • Accessing an entry would require a binary search O (ln( n ))
    • Using an AVL tree would still require that all operations require O (ln( n )) time
  • 13. IP Addresses
    • Can we do better than O (ln( n ))
      • Can we get it down to  (1) ?
    • Problem:
      • So long as we require that the entries are sorted, we cannot do better than O (ln( n ))
    • Do we care about the order?
      • Do we need to know the IP address of the domain name which comes alphabetically after churchill.uwaterloo.ca ?
  • 14. UW Student ID Numbers
    • Let’s start with an easier example:
      • Each UW student is assigned an 8-digit UW Student ID Number
      • Allocating an array of size 10 8 is wasteful
      • UW has only had ~10 5 students
      • There are only ~10 2 students in this class
    • Suppose I want to store the grade associate with each student in this class
  • 15. UW Student ID Numbers
    • Solution:
      • Allocate an array of 1000 bins
      • The bins are labeled 000, 001, ..., 999
      • Store the mark of the student with number 20123 456 in bin 456
    • Benefits:
      • Taking the modulo 1000 is  (1)
      • Modulo n is the remainder after dividing by n
      • Accessing an array entry is also  (1)
      • Only 100 students: 1 in 10 bins are filled
    ... ... ... ... 454 455 456 84 457 458 459 460 461 462 463 79 464 465
  • 16. UW Student ID Numbers
    • Problem:
      • Multiple students may have the same last three digits
      • Assuming the last three digits are random:
        • What is the probability that all students will a different set of last three digits?
      • Answer: 0.5%
    • Similar question:
      • What is the probability that, in a group of 22 students, no two students share the same birthday?
      • Answer: 49%
  • 17. UW Student ID Numbers
    • The process of mapping a number onto a smaller range is called hashing
    • The difficulty where multiple objects may hash to the same value is said to be a collision
    • Hash tables use a hash function together with a mechanism for dealing with collisions
  • 18. IP Addresses
    • Going back to our issue with UW being assigned 10 32 128-bit IP addresses
      • Assume we will use at most 2 20 IP addresses
      • Allocate an array of size 2 20
      • Define a hash function which deals with 128-bit inputs and maps it down to a number from 0, ..., 2 20 – 1
      • Deal with collisions
  • 19. IP Addresses
    • Problem #2:
      • How do we go the other way?
    • For example, given churchill.uwaterloo.ca , how do we find the corresponding IP address 129.97.90.209 today?
    • Even with the 32-bit IP address of today, this is still a significant problem
    • Same idea:
      • Take a hash of the string which maps it to a value on the range 0, ..., 2 16 – 1
      • Deal with collisions and look it up in an array of size 2 16
  • 20. IP Addresses
    • We will break the process into three independent steps:
    Object 32-bit integer Map to an index 0, ..., M – 1 Deal with collisions Techniques vary... Modulo, mid-square, multiplicative, Fibonacci Chained hash tables Open addressing Linear Probing Double Hashing
  • 21. Summary
    • Discuss storing unordered data
    • Discuss IP addresses and domain names
    • Consider conversions between these two forms
    • Introduce the idea of using a smaller array
      • Converted “large” numbers into valid array indices
      • Reduces O (ln( n )) in arrays and AVL trees to to O (1)
    • Discussed the issues with collisions
  • 22. Usage Notes
    • These slides are made publicly available on the web for anyone to use
    • If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:
      • that you inform me that you are using the slides,
      • that you acknowledge my work, and
      • that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides
      • Sincerely,
      • Douglas Wilhelm Harder, MMath
      • [email_address]