1.
COMSM1402 Advanced Algorithms 2010 Rapha¨l Cliﬀord e November 5, 2010 Due: The coursework should be handed in at the start of the lecture onFriday, 17 December. This is both the normal and late deadline. Online sub-missions can be made up to midnight. Your marks will be based on the bestﬁve answers out of the ﬁrst six questions plus your mark for question seven.Problems: 1. The purpose of this question is to show a weakly universal class of hash √ functions H for which E[M ] = n − 1 . M is the maximum load as- suming n items are hashed into n slots using a universal family of hash def functions. For positive n, we use the notation [n] = {0, . . . , n − 1}. Deﬁne a family H of hash functions from [n] to [n] as follows. Let be an integer, with 1 ≤ ≤ n. For each V ⊂ [n] of cardinality , we deﬁne the hash function hV : [n] → [n] by the following property. hV maps each element of V onto 0, and hV maps [n]V injectively into [n]{0}. Note that hV is not uniquely determined by this property, but we can always choose one hV satisfying this property (verify). Deﬁne H := {hV : V ⊂ [n], |V | = } √ Argue that H is weakly universal if ≤ n − 1. Note that the maximum load always equals . [10 points] 2. The following approach is useful in streaming algorithms; you should think about why this might be. Suppose that we have a sequence of items, passing by one at a time. We want to maintain a sample of one item that has the property that it is uniformly distributed over all the items that we have seen at each step. Moreover, we want to accomplish this without knowing the total number of items in advance or storing all of the items that we see. Consider the following algorithm, which stores just one item in memory at all times. When the ﬁrst item appears, it is stored in the memory. When the kth item appears, it replaces the item in memory with probability 1/k. Explain why this algorithm solves the problem. Now suppose instead we want a sample of s items instead of just one, without replacement. That is, we don’t want to get the same item multiple 1
2.
times in our sample. If this weren’t an issue, we could get a sample of s items with replacement just by running s independent copies of the above. Generalize the above process to that case. (Hint: start by taking the ﬁrst s items and storing them as your sample. With what probability should each new item come into the sample?) [10 points]3. The simplest variant of cuckoo hashing is as follows. There is a table with m cells. Each element x can hash into exactly two locations, given by hash functions, h1 (x) and h2 (x). When an item is placed into the hash table, if at least one of these two location is free, the item is placed in the free location. If neither locations is free, x is placed in one of the two locations, and kicks out the element y that is in that location. Then y is placed in its alternative location. If that location is free, then all is well, and y is placed there. Otherwise, y must kick out the element in that location, and this new element must try to move to its alternative location, and so on. It is possible that, at some point, the process will loop. The loop can either be found explicitly, or a limit on the number of times elements can be kicked out can be enforced and the whole dataset rehashed if this limit is ever reached. One way to generalise this is to use more than two hash functions so that each element has more than two alternatives for which element to kick out randomly at each step. The task is to implement a generalised variant of cuckoo hashing. You should make a choice about how you will create the hash functions and explain it clearly in terms of the randomness and independence you are using. You could for example, simply toss some coins if you only need a small number of random bits to start oﬀ. Feel free to try diﬀerent hash function families and report on what eﬀect, if any, this has. You may also want to experiment with creating random numbers using methods described in the lectures or otherwise. In your experiments, use a table of size 8192, and add elements until the ﬁrst time you cannot add an element. (For convenience, you may assume an element cannot be added if, after repeating the kick out step 20 times, you are not done.) Using 2 hash functions and then 3 hash functions, and running the experiment 1000 times, examine how full the hash table can be before problems start to occur. Compare your results with the bounds from the theory and discuss what you ﬁnd. For this problem, please submit your code. You can choose any programming language you like, but please include clear instructions on how to run your code on a lab machine in a ﬁle called readme.txt that is included with your submission. [10 points]4. This question has two parts. A naive implementation of a van Emde Boas tree uses O(|U |) space, where |U | is the universe size. Explain in detail 2
3.
how this can be reduced to O(n) space (where n is the number of elements to be stored). What are the complexities of the diﬀerent operations in your reduced space data stucture? The van Emde Boas tree layout can be used to implement a number of other data stuctures and to speed up important applications. Find an example from the literature and explain in detail how the van Emde Boas tree improves the time complexities of the relevant operations. Your ex- planation should give suitable citations and ideally provide proofs of any results you report. [10 points]5. Consider the following pattern matching problem involving wildcard sym- bols. A single character wildcard is said to match any other symbol in the input alphabet. INPUT: Text T = t1 . . . tn , pattern P = p1 . . . pm . At most of the pat- tern characters pi are non-wildcards (i.e. normal characters) and the rest single character wildcards. OUTPUT: The Hamming distance between P and every substring of T of length m. Example: let p = ab?ab and text t = b?bbabba and = 4. The output is 3, 0, 2, 4. (a) Give an algorithm that solves this problem. (b) What is the asymptotic time complexity of your algorithm? Make sure to explain your working carefully. The better the time complexity, the more marks will be awarded. In particular, extra marks will be given for fast solutions whose running time is parameterised by as well as n and m. A Θ(nm) time solution will gain no marks. You can assume it takes no more than log2 n bits (i.e a single word of memory) to represent any of the input symbols and that simple arithmetic operations on the input symbols, including addition and multiplication take constant time. [10 points]6. (a) The recurrence for the running time of the algorithm for computing a suﬃx array presented in lectures is T (n) = T (2n/3) + O(n). Show how to modify the algorithm to give one whose recurrence is T (n) = T (3n/7) + O(n). Is 3/7 the best possible, or can you do better? (b) Suppose we have a pattern p and a text t and we want to ﬁnd for every position in t the longest substring of p that matches there exactly. 3
4.
Give a fast algorithm to solve this problem together with its analysis. The better the time complexity, the more marks will be awarded. [10 points] 7. For this question you are asked to write a two page summary of a research paper. I would like you to choose a highly cited paper from one of the leading algorithms conferences to write about. Luckily there is already a website (http://www.cs.utah.edu/~suresh/citations/) that has been through the papers written from 1997–2006 for FOCS, STOC and SODA (look up what these stand for) and counted the citation numbers for you although these numbers are now underestimates in most cases. Alter- natively you may choose a paper from any of the conferences listed at http://www.cs.tau.ac.il/~iftgam/eventlist.htm. You should check on http://scholar.google.com that any paper you choose has a current citation count of at least one hundred. Please post the title of the paper, its authors, the conference name and the number of citations on the unit forum as soon as you have made your choice. You may not, of course, choose the same paper as someone else. Your two page review should include: • A short one or two paragraph summary of the paper. • A deeper, more extensive outline of the main points of the paper, including for example assumptions made, arguments presented, data analyzed, and conclusions drawn. • Any limitations or extensions you see for the ideas in the paper. • Your opinion of the paper; primarily, the quality of the ideas and its real or potential impact. [30 points]Academic Integrity: All the work you hand in should be your own. If youwork with other students, you should list them on your coursework along witha brief explanation of which topics you discussed. In general, any source otherthan the lectures should be explicitly cited at the point where it is used. 4
Be the first to comment