Ds Ch2


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ds Ch2

  1. 1. Outline List Searches C++ Search Algorithms Chapter 2. Searching Hashed List Searches Collision Resolution 2009/10/1 1 2009/10/1 Klsue/NCU 2 List Searches Basic Search Concept The algorithm used to search a list depends on the structure of the list Two basic searches for arrays the sequential search the binary search Basic search concept is shown in Fig. 2-1 Figure 2-1 Locating data in unordered list 2009/10/1 Klsue/NCU 3 2009/10/1 Klsue/NCU 4 1
  2. 2. List Searches Sequential Search Be used whenever the list is not ordered. Use for small lists lists that are not searched often It give us two possibilities: find it or reach the end of the list 2009/10/1 Klsue/NCU 5 2009/10/1 Klsue/NCU Figure 2-2 Search concept 6 Sequential Search Algorithm Must tell the calling algorithm two things: Did it find the target data? At what index are the target data found requires four parameters: the list an index (to the last element) the target the address (where the target is found) Figure 2-3 Unsuccessful search in unordered list 2009/10/1 Klsue/NCU 7 2009/10/1 Klsue/NCU 8 2
  3. 3. Sequential Search Algorithm Sequential Search Algorithm Algorithm 2-1 Sequential search Algorithm 2-1 Sequential search (continued) 2009/10/1 Klsue/NCU 9 2009/10/1 Klsue/NCU 10 Variations on Sequential Searches Variations on Sequential Searches Three useful variations Observation: the sentinel search the loop tests two conditions: the probability search end of list the ordered list search target not found Knuth states: “When the inner loop of a program tests two or more conditions, an attempt should be made to reduce it to just one condition.” 2009/10/1 Klsue/NCU 11 2009/10/1 Klsue/NCU 12 3
  4. 4. Variations on Sequential Searches Sentinel Search Sentinel search A target is put in the list by adding an sentinel entry at the end of the array And placing the target in the sentinel Pseudocode for the sentinel search is shown in Algorithm 2-2 Algorithm 2-2 Sentinel search 2009/10/1 Klsue/NCU 13 2009/10/1 Klsue/NCU 14 Sentinel Search Variations on Sequential Searches Probability search the array is ordered with the most probable search elements at the beginning the least probable at the end It is especially useful when few elements are the targets for most of the searches Algorithm 2-2 Sentinel search (continued) 2009/10/1 Klsue/NCU 15 2009/10/1 Klsue/NCU 16 4
  5. 5. Probability Search Probability Search To ensure the probability ordering is correct over time exchange the located element with the element before it in each search A typical implementation of the probability search is shown in Algorithm 2-3 2009/10/1 Klsue/NCU 17 2009/10/1 Klsue/NCU 18 Algorithm 2-3 Probability search Probability Search Variations on Sequential Searches When searching a small list ordered sequential search may be more efficient Than binary search Ordered list search not necessary to search to the end stop searching when target becomes less than or equal to current element Algorithm 2-3 Probability search (continued) 2009/10/1 Klsue/NCU 19 2009/10/1 Klsue/NCU 20 5
  6. 6. Ordered List Search Ordered List Search Algorithm 2-4 Ordered list search Algorithm 2-4 Ordered list search (continued) 2009/10/1 Klsue/NCU 21 2009/10/1 Klsue/NCU 22 Binary Search Binary Search A more efficient algorithm Eliminate half the list until whenever the list starts to become large we find the target or the target is not in the list Suggest: Three variables are required consider binary searches whenever the list contains more than 16 elements the beginning of the list the middle of the list Idea: the end of the list Testing the data in the element at the middle determine if the target is in the first or second half of the list 2009/10/1 Klsue/NCU 23 2009/10/1 Klsue/NCU 24 6
  7. 7. Target Found Figure 2-4 shows how we find 22 Two conditions terminate binary search the target is found first becomes larger than last We can calculate mid as follows: 2009/10/1 Klsue/NCU 25 2009/10/1 Klsue/NCU 26 Figure 2-4 Binary search example Target not found Target not found when the target is not in the list Binary search algorithm stops by testing for first and last crossing Figure 2-5 Unsuccessful binary search example 2009/10/1 Klsue/NCU 27 2009/10/1 Klsue/NCU 28 7
  8. 8. Target not found Binary search Algorithm Four parameters describe: the list the index of the last element the target we are looking for the address into which we place the located index Figure 2-5 Unsuccessful binary search example (continued) 2009/10/1 Klsue/NCU 29 2009/10/1 Klsue/NCU 30 Binary search Algorithm Binary search Algorithm Algorithm 2-5 Binary search Algorithm 2-5 Binary search (continued) 2009/10/1 Klsue/NCU 31 2009/10/1 Klsue/NCU 32 8
  9. 9. Binary search Algorithm Analyzing Search Algorithms Which is the best? Of the five search algorithm discussed An application often determine which algorithm should be used Sequential Search the basic loop for the sequential search: Algorithm 2-5 Binary search (continued) 2009/10/1 Klsue/NCU 33 2009/10/1 Klsue/NCU 34 Analyzing Search Algorithms Analyzing Search Algorithms the sequential search algorithm is linear The loop for binary search its efficiency is O(n) Binary Search locate an item by repeatedly dividing the list in half 2009/10/1 Klsue/NCU 35 2009/10/1 Klsue/NCU 36 9
  10. 10. Analyzing Search Algorithms Analyzing Search Algorithms The loop for binary search is a logarithmic loop Comparison Disregarding the time required to order the list Binary search is obviously more efficient for searching a list of any significant size As shown in Table 2-1 Table 2-1 Comparison of binary and sequential searches 2009/10/1 Klsue/NCU 37 2009/10/1 Klsue/NCU 38 C++ Search Algorithms Sequential Search in C++ program 2-1 program 2-1 Analysis Why use while loop rather than for loop? Finding something is an event, so use an event loop Call-by-reference used for locn Pass back the matching location 2009/10/1 Klsue/NCU 39 2009/10/1 Klsue/NCU 40 10
  11. 11. C++ Search Algorithms Binary Search in C++ shown in Program 2-2 Program 2-2 Analysis we use a break statement to force the while loop to end we can force the loop to end with the following statement: 2009/10/1 Klsue/NCU 41 2009/10/1 Klsue/NCU 42 Hashed List Searches The goal of a hashed search: to find a data with only one test Basic concepts the key determines the location of the data Use a hashing algorithm to transform key into index Another way to describe: A key-to-address transformation in which the keys map to addresses in a list as shown in Figure 2-6 2009/10/1 Klsue/NCU 43 2009/10/1 Klsue/NCU 44 11
  12. 12. Basic Concept of Hash Search Synonyms the set of keys that hash to the same location Collision hashing algorithm produces an address for an insertion key that address is already occupied Figure 2-6 The hash concept 2009/10/1 Klsue/NCU 45 2009/10/1 Klsue/NCU 46 Basic Concept of Hash Search Collision resolution concept Home Address the address produced by the hashing algorithm Prime Area the memory that contains all of the home addresses When two keys collide at a home address Resolve the collision by placing the later data in another location collision resolution concept is shown in Figure 2-7 Figure 2-7 The collision resolution concept 2009/10/1 Klsue/NCU 47 2009/10/1 Klsue/NCU 48 12
  13. 13. Basic Concept of Hash Search Hashing Method Locate an element Basic hashing techniques use the same algorithm that we used to insert it hash the key and check the home address If it doesn’t contain the desired element • Use the collision resolution algorithm to determine the next location Probe Each calculation of an address and test for success Figure 2-8 Basic hashing techniques 2009/10/1 Klsue/NCU 49 2009/10/1 Klsue/NCU 50 Direct Method Direct hashing Direct hashing Application the key is the address Use the day of the month as the key without any algorithmic manipulation add the sale amount to the corresponding data structure must contain an element for every accumulator possible key The accumulation code is shown below Evaluation the situations you can use direct method is limited but it is very powerful • it guarantees No synonyms No collisions 2009/10/1 Klsue/NCU 51 2009/10/1 Klsue/NCU 52 13
  14. 14. Direct hashing More complex example A company has fewer than 100 employees Let employee numbers be between 1 and 100 The employee number can be direct used • as the address of any individual record The concept is shown in Figure 2-9 Figure 2-9 Direct hashing of employee numbers 2009/10/1 Klsue/NCU 53 2009/10/1 Klsue/NCU 54 Subtraction Method Modulo-Division Method keys are consecutive but not start from 1 Also known as division remainder For example divides the key by the array size the employee numbers are 1001~1100 ( the company uses the remainder for the address have 100 employees ) use a hashing function subtracts 1000 from the key to determine address Only be used for small lists This algorithm works with any list size in which the keys map to a densely filled list Prime number is better for the list size Produce fewer collisions than other list size 2009/10/1 Klsue/NCU 55 2009/10/1 Klsue/NCU 56 14
  15. 15. Modulo-Division Method Application A new employee numbering system that will handle 1,000,000 employees We provide data space for up to 300 employees The first prime number greater than 300 is 307 E.g. Brian Devaux’s employee number, 121267 Figure 2-10 Modulo-division hashing 2009/10/1 Klsue/NCU 57 2009/10/1 Klsue/NCU 58 Digit-Extraction Method Midsquare Method Digit extraction Midsquare hashing selected digits are extracted from the key the key is squared and used as the address the address selected from the middle of the For example squared number 6-digit employee number to 3-digit address The most obvious limitation of this method the size of the key e.g: the midsquare address calculation of 9452 2009/10/1 Klsue/NCU 59 2009/10/1 Klsue/NCU 60 15
  16. 16. Midsquare Method Folding Method Variation of Midsquare Method Two folding methods: select a portion of the key fold shift for example: fold boundary Fold shift the key value is divided into parts whose size matches the size of the required address the same digits must be selected from the product the left and right parts are shifted and added with the middle part Shown in Figure 2-11(a) 2009/10/1 Klsue/NCU 61 2009/10/1 Klsue/NCU 62 Folding Method Folding Method Fold boundary The left and right numbers are folded on a the boundary between them and the center number Shown in Figure 2-11(b) The two folding methods give different hashed addresses Figure 2-11 Hash fold examples 2009/10/1 Klsue/NCU 63 2009/10/1 Klsue/NCU 64 16
  17. 17. Rotation Method Rotation Method Rotation hashing usually work with other hashing methods is most useful when keys are serial Example If hashing keys are identical except the last character Resolution: Rotating the last character to the front of the key Figure 2-12 Rotation hashing 2009/10/1 Klsue/NCU 65 2009/10/1 Klsue/NCU 66 Rotation Method Pseudorandom Method In Figure 2-12 Pseudorandom number generator all keys now end in 60010 the key is used as the seed modulo division would not work well Scale the resulting random number Rotation is often used in combination with into the possible address range folding/pseudorandom hashing by using modulo division By fold shift to two-digit address A common RNG is as follows: We get 26, 36, 46, 56, 66 Spreading the data more evenly 2009/10/1 Klsue/NCU 67 2009/10/1 Klsue/NCU 68 17
  18. 18. Pseudorandom Method Pseudorandom Method The result is divided by the list size Example from Figure 2-10 with the remainder plus 1 being the hashed Use a=17, c=7 address For maximum efficiency The factor a and c should be prime numbers 2009/10/1 Klsue/NCU 69 2009/10/1 Klsue/NCU 70 Hashing Algorithm Hashing Algorithm hashing a key to an address Assume the hashing methods may work well An alphanumeric key up to 30 bytes Need to hash into 32-bit address hashing a large files is more complex Steps requires analysis of the population of keys Convert the key into a number by ASCII determine the number of synonyms Accumulate the gained values Rotate the bits in the address An example is discussed below Take the absolute value of the address Algorithm 2-6 A hashing algorithm 2009/10/1 Klsue/NCU 71 2009/10/1 Klsue/NCU 72 18
  19. 19. Hashing Algorithm Algorithm 2-6 A hashing algorithm Algorithm 2-6 A hashing algorithm (continued) 2009/10/1 Klsue/NCU 73 2009/10/1 Klsue/NCU 74 Algorithm 2-6 Analysis Collision Resolution Discussion in Algorithm 2-6 No hashing methods are 1-to-1 mapping The rotation in statement 3.1.2 except the direct and subtraction methods can be finished by an assembly language instruction Several collision resolution methods This algorithm uses three of the hashing methods They are independent of the hashing algorithm Fold shift Rotation shown in Figure 2-13 Modulo division 2009/10/1 Klsue/NCU 75 2009/10/1 Klsue/NCU 76 19
  20. 20. Collision Resolution Collision Resolution ─ Three concepts load factor of a hashed list The formula in which k: the number of filled elements n: the total number of elements Figure 2-13 Collision resolution methods 2009/10/1 Klsue/NCU 77 2009/10/1 Klsue/NCU 78 Collision Resolution ─ Three concepts Collision Resolution ─ Three concepts Clustering Two types of Clustering the tendency of data to build up unevenly Primary clustering across a hashed list occurs when data cluster around a home address easy to identify If the list contain a high degree of clustering Secondary clustering The number of probes to locate an element grows occurs when data become grouped along a collision Reduces the processing efficiency of the list path throughout a list is not easy to identify Goal Design hashing algorithm to minimize clustering 2009/10/1 Klsue/NCU 79 2009/10/1 Klsue/NCU 80 20
  21. 21. Collision Resolution ─ Three concepts Collision Resolution Final concept Open Addressing Number of elements examined for a place to Resolve collisions in the prime area store the data must be limited That contains all of the home addresses Traditional limits of examining all elements Four different methods: The search is not sequential linear probe • Finding the end not mean every element is tested quadratic probe Examining every element would be time-consuming double hashing Some of the collision resolution techniques cannot key offset physically examine all of the elements 2009/10/1 Klsue/NCU 81 2009/10/1 Klsue/NCU 82 Open Addressing Linear Probe If data cannot be stored in the home address adding 1 to the current address. Two advantages quite simple to implement data tend to remain near their home address Tend to produce primary clustering Figure 2-14 Linear probe collision resolution 2009/10/1 Klsue/NCU 83 2009/10/1 Klsue/NCU 84 21
  22. 22. Open Addressing Quadratic Probe Quadratic Probe increment is square of collision probe number potential disadvantage The time to compute the square limitation Can not generate a new address for every element • If the list size=100, see Table 2-2 Only 59 of the probes generate unique addresses The other 41 locations will not be probed Table 2-2 Quadratic collision resolution increments 2009/10/1 Klsue/NCU 85 2009/10/1 Klsue/NCU 86 Open Addressing Pseudorandom Collision Resolution use the collision address as a factor in the random number calculation Significant limitation All keys follow only one collision resolution path Figure 2-15 Pseudorandom collision resolution 2009/10/1 Klsue/NCU 87 2009/10/1 Klsue/NCU 88 22
  23. 23. Open Addressing Key Offset Key Offset For example a double hashing method The key is 070918 and the list size is 307 produces different collision paths for different keys using the modulo-division hashing method one of the simplest versions • generate an address of 1 simply adds the quotient of the key divided by the list Shown in Figure 2-15 size to the address 166702 produce a collision at address 1 Using key offset to create the next address 2009/10/1 Klsue/NCU 89 2009/10/1 Klsue/NCU 90 Key Offset Linked List Resolution To really see the effect of key offset A major disadvantage to open addressing We need to calculate several different keys each collision resolution increases the all of them hash to the same home address 1 probability of future collision Linked list an ordered collection of data each element contains the location of next element Table 2-3 Key-Offset examples 2009/10/1 Klsue/NCU 91 2009/10/1 Klsue/NCU 92 23
  24. 24. Linked List Resolution uses two storage areas the prime area the overflow area linked list data can be stored in any order Most common order a last in-first out (LIFO) sequence element is placed at the beginning of the overflow list a key sequence Figure 2-16 Linked list collision resolution 2009/10/1 Klsue/NCU 93 2009/10/1 Klsue/NCU 94 Bucket Hashing Because a bucket can hold multiple data collision are postponed until the bucket is full Two problems it uses significant more space it dose not completely resolve the collision problem Figure 2-17 Bucket hashing 2009/10/1 Klsue/NCU 95 2009/10/1 Klsue/NCU 96 24
  25. 25. Combination Approaches A complex implementation often uses multiple steps For example One large database hashes to a bucket If the bucket is full • It use a set number of linear probes • then uses a linked list overflow area 2009/10/1 Klsue/NCU 97 25