Upcoming SlideShare
×

# Tutorial 3 (b tree min heap)

865 views

Published on

Part of the Search Engine course given in the Technion (2011)

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
865
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
21
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Tutorial 3 (b tree min heap)

1. 1. B-Tree Lexicon, Min-Heaps Kira Radinsky Min-Heap slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab
2. 2. 2 November 2010 236621 Search Engine Technology 2 The Lexicon as a B-Tree • B-Tree: a balanced tree that is optimized for disk I/O, holding key/value pairs • Branching is defined by a min-degree parameter t, t > 1 – t is chosen according to the size of a disk block • Any internal node other than the root has at least t and at most 2t children; the root has either no children, or at least two and at most 2t children • Any internal node with k children also stores k-1 keys which serve as separator values: separator j is larger than the keys of subtree j and smaller than the keys of subtree j+1 • Leaf nodes, like all nodes, store at most 2t-1 key/value pairs – When not the root, store at least t-1 key/value pairs • Lookup, insertion and deletion operations on a B-Tree are linear in its height (and t-logarithmic in the number of keys)
3. 3. 2 November 2010 236621 Search Engine Technology 3 B-Tree Lexicon - Example • t=2 • Each key is associated with a value that contains a DF and a pointer to the postings list (dashed line) gets more 1 2 and as bad 3 1 2 good is it 2 1 2 the ugly 1 2
4. 4. 2 November 2010 236620 Search Engine Technology 4 B-Tree Lookup Looking up the value associated with key x: 1. current_node  root 2. Let k1<k2<…<km be the keys of current_node 3. if x{k1,k2,…,km} – we’re done, return associated value 4. else, if current_node is a leaf node, return null 5. else, let j be the smallest index s.t. x<kj (j  m+1 if x>km); – current_node  j’th subtree, and goto 2
5. 5. 2 November 2010 236621 Search Engine Technology 5 Top-r Document Selection Problem definition: Given a set A of scored documents, select the r documents with the highest scores in A and return them in decreasing relevance order • Naïve method: sort the set A by score – If |A|=M, time complexity is O(M logM) • Better approach: since typically r<<M, selecting the r top scores can be done in O(M+r log M) time using a heap: 1. Heapify the set of M scores (about 2M comparisons) so that the top score is at the root 2. Repeatedly extract the heap’s root (r times), each time fixing the heap in O(logM)
6. 6. 2 November 2010 236621 Search Engine Technology 6 The Heap Data Structure - Reminder • A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be non-decreasing (alternatively, non-increasing) along each path from a leaf to the root – Largest/smallest value is at the root • Heap implemented in an Array: – Root at index 1 – For node at index i, left child is at index 2i and right child at index 2i+1 – Thus the parent of the node at index i is at index i/2
7. 7. 2 November 2010 236621 Search Engine Technology 7 Binary Heap Stored in an Array 23 17 28 5 15 13 144 17 23 17 15 17 8 2 13 4 14 5 1 2 3 4 5 6 7 8 9 10
8. 8. 2 November 2010 236621 Search Engine Technology 8 Extracting the Top Element • Remove the largest item r times • Each time: – Remove the largest item – the root of the heap – Replace it with the last element of the heap – Sift the new root down until restoring order • Example – Remove item 23 from the root – Last item in array 5 (at location 10) replaces it – Reinstate heap order - worst case 5 will be sifted back down the tree - number of sifts is bounded by log(size of heap)
9. 9. 2 November 2010 236621 Search Engine Technology 9 Heap Example (cont.) To restore order at the top level of tree, item 17, the larger of the 2 children of root must be swapped with 5. This limits the order violation to the left sub-tree. 5 17 28 15 13 144 17 The process is repeated until heap order is restored
10. 10. 2 November 2010 236621 Search Engine Technology 10 5 17 28 15 13 144 17 17 17 28 15 13 54 14 17 5 28 15 13 144 17 17 17 28 15 13 144 5 Heap Example (cont.)
11. 11. 2 November 2010 236621 Search Engine Technology 11 Top-r Selection Using a Min-Heap • The selection problem can be solved by a heap that stores the smallest item at the root: min-heap • A min-heap of r items is held instead of a max-heap of M – lots of memory is saved, which is always good • Process the M scores, storing in the min-heap the r largest values seen so far – First r values are heapified in O(r) comparisons – Replace the smallest value in the min-heap (the rth largest) whenever a larger value is found • Sort the r highest values in descending order and return the corresponding documents – O(r log r)
12. 12. 2 November 2010 236621 Search Engine Technology 12 Min-Heap Processing - Illustration Processed Unprocessed Min-heap of r largest items Discard smallest value
13. 13. 2 November 2010 236621 Search Engine Technology 13 Top-r Selection Using a Min-Heap: Complexity Analysis • Worst case: the scores are already in increasing order – Each of the M-r last values is inserted into the heap – Furthermore, it percolates to the bottom of the heap – Complexity is O( (M-r)*log(r) ) • Average case – the scores arrive in a permutation of size M chosen uniformly at random – The expected number of times one of the M-r last values is inserted into the heap is ~ r*ln(M/r) – Each insertion costs O(log(r)) – Complexity is O( r*log(r)*log(M/r) ) • Proof on the board