Data Integration, Concluded and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Informati...
An Alternate Approach: The Information Manifold (Levy et al.) <ul><li>When you integrate something, you have some conceptu...
The Local-as-View Model <ul><li>The basic model is the following: </li></ul><ul><ul><li>“Local” sources are views over the...
Answering Queries Using Views  <ul><li>Assumption:  conjunctive queries ,  set semantics </li></ul><ul><ul><li>Suppose we ...
Query Answering <ul><li>Suppose we have the query: </li></ul><ul><li>  q(a, t, p) :- author(a, i, _), book(i, t, p) </li><...
Inverse Rules <ul><li>We can take every mapping and “invert” it, though sometimes we may have insufficient information: </...
But NULLs Lose Information <ul><li>Suppose we take these rules and ask for:  </li></ul><ul><ul><ul><ul><li>q(a,t) :- autho...
The Solution: “Skolem Functions” <ul><li>Skolem  functions: </li></ul><ul><ul><li>Conceptual “perfect” hash functions </li...
Revisiting Our Example <ul><li>Query:  </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></...
Query Answering Using  Inverse Rules <ul><li>Invert all rules using the procedures described </li></ul><ul><li>Take the qu...
Summary of Data Integration <ul><li>Local-as-view integration has replaced global-as-view as the standard </li></ul><ul><u...
Performance:  What Governs It? <ul><li>Speed of the machine – of course! </li></ul><ul><li>But also many software-controll...
Our General Emphasis <ul><li>Goal: cover basic  principles  that are applied throughout database system design </li></ul><...
What’s the “Base” in “Database”? <ul><li>Could just be a file with random access </li></ul><ul><ul><li>What are the advant...
Buffer Management <ul><li>Could keep DB in RAM </li></ul><ul><ul><li>“ Main-memory DBs” like TimesTen </li></ul></ul><ul><...
Storing Tuples in Pages <ul><li>Tuples </li></ul><ul><ul><li>Many possible layouts </li></ul></ul><ul><ul><ul><li>Dynamic ...
Alternatives for Organizing Files <ul><li>Many alternatives,  each ideal for some situation, and poor for others : </li></...
Model for Analyzing Access Costs <ul><li>We ignore CPU costs, for simplicity: </li></ul><ul><ul><li>p(T):   The number of ...
Assumptions in Our Analysis <ul><li>Single record insert and delete </li></ul><ul><li>Heap files: </li></ul><ul><ul><li>Eq...
Cost of Operations  Delete Insert Range Search Equality Search Scan all recs Hashed File Sorted File Heap File
Cost of Operations <ul><li>Several assumptions underlie these (rough) estimates! </li></ul>2D Search + p(T) D Search + D D...
Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul...
Technique I: Indexing <ul><li>An  index  on a file speeds up selections on the  search   key   attributes   for the index ...
Alternatives for Data Entry  k*  in Index <ul><li>Three alternatives: </li></ul><ul><ul><li>Data record with key value  k ...
Classes of Indices <ul><li>Primary   vs.   secondary :  primary has primary key </li></ul><ul><li>Clustered   vs.   unclus...
Clustered vs. Unclustered Index <ul><li>Suppose Index Alternative (2) used, records are stored in Heap file </li></ul><ul>...
B+ Tree:  The DB World’s  Favorite Index <ul><li>Insert/delete at log  F  N cost  </li></ul><ul><ul><li>(F = fanout, N = #...
Example B+ Tree <ul><li>Search begins at root, and key comparisons direct it to a leaf. </li></ul><ul><li>Search for 5*, 1...
B+ Trees in Practice <ul><li>Typical order: 100.  Typical fill-factor: 67%. </li></ul><ul><ul><li>average fanout = 133 </l...
Inserting Data into a B+ Tree <ul><li>Find correct leaf L.  </li></ul><ul><li>Put data entry onto L. </li></ul><ul><ul><li...
Inserting 8* into Example B+ Tree <ul><li>Observe how minimum occupancy is guaranteed in both leaf and index pg splits. </...
Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert ...
Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to sp...
Deleting Data from a B+ Tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. <...
B+ Tree Summary <ul><li>B+ tree and other indices ideal for range searches, good for equality searches. </li></ul><ul><ul>...
Other Kinds of Indices <ul><li>Multidimensional indices </li></ul><ul><ul><li>R-trees, kD-trees, … </li></ul></ul><ul><li>...
Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul...
Technique II:  Sorting <ul><li>Pass 1: Read a page, sort it, write it. </li></ul><ul><ul><li>only one buffer page is used ...
Two-Way External Merge Sort <ul><li>Each pass we read, write each page in file. </li></ul><ul><li>N pages in the file => t...
General External Merge Sort <ul><li>To sort a file with  N  pages using  B  buffer pages: </li></ul><ul><ul><li>Pass 0: us...
Cost of External Merge Sort <ul><li>Number of passes: </li></ul><ul><li>Cost = 2N * (# of passes) </li></ul><ul><li>With 5...
Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul...
Technique 3:  Hashing <ul><li>A familiar idea: </li></ul><ul><ul><li>Requires “good” hash function (may depend on data) </...
Upcoming SlideShare
Loading in …5
×

11/16 Slides

675 views

Published on

Published in: Education, Spiritual
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
675
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 2
  • 3
  • 4
  • 5
  • 7
  • 8
  • 11
  • 12
  • 9
  • 10
  • 6
  • 12
  • 14
  • 23
  • 5
  • 6
  • 7
  • 8
  • 11/16 Slides

    1. 1. Data Integration, Concluded and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 16, 2004
    2. 2. An Alternate Approach: The Information Manifold (Levy et al.) <ul><li>When you integrate something, you have some conceptual model of the integrated domain </li></ul><ul><ul><li>Define that as a basic frame of reference, everything else as a view over it </li></ul></ul><ul><ul><li>“ Local as View” </li></ul></ul><ul><li>May have overlapping/incomplete sources </li></ul><ul><ul><li>Define each source as the subset of a query over the mediated schema </li></ul></ul><ul><ul><li>We can use selection or join predicates to specify that a source contains a range of values : </li></ul></ul><ul><ul><ul><li>ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers” </li></ul></ul></ul>
    3. 3. The Local-as-View Model <ul><li>The basic model is the following: </li></ul><ul><ul><li>“Local” sources are views over the mediated schema </li></ul></ul><ul><ul><li>Sources have the data – mediated schema is virtual </li></ul></ul><ul><ul><li>Sources may not have all the data from the domain – “open-world assumption” </li></ul></ul><ul><li>The system must use the sources (views) to answer queries over the mediated schema </li></ul>
    4. 4. Answering Queries Using Views <ul><li>Assumption: conjunctive queries , set semantics </li></ul><ul><ul><li>Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher) </li></ul></ul><ul><ul><li>A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul></ul><ul><li>Recall intuitions about this class of queries: </li></ul><ul><ul><li>Adding a conjunct to a query removes answers from the result but never adds any </li></ul></ul><ul><ul><li>Any conjunctive query with at least the same constraints & conjuncts will give valid answers </li></ul></ul>
    5. 5. Query Answering <ul><li>Suppose we have the query: </li></ul><ul><li> q(a, t, p) :- author(a, i, _), book(i, t, p) </li></ul><ul><li>and sources: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s6(i,p)  book(i, t, p) </li></ul></ul></ul></ul><ul><li>We want to compose the query with the source mappings – but they’re in the wrong direction! </li></ul>
    6. 6. Inverse Rules <ul><li>We can take every mapping and “invert” it, though sometimes we may have insufficient information: </li></ul><ul><ul><li>If </li></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><li>then we can also infer that: </li></ul></ul><ul><ul><ul><ul><li>author(a, i, ??? )  s5(a,i) </li></ul></ul></ul></ul><ul><li>But how to handle the absence of the 3 rd (publisher) attribute? </li></ul><ul><ul><li>We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair </li></ul></ul><ul><ul><li>So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)… </li></ul></ul>
    7. 7. But NULLs Lose Information <ul><li>Suppose we take these rules and ask for: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>If we look at the rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Clearly q(a,t)  s1(a,t) </li></ul><ul><li>But if apply our inversion procedure, we get: </li></ul><ul><ul><ul><ul><li>author(a, NULL , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( NULL , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>and there’s no way to kow to join author and book on NULL! </li></ul><ul><ul><li>We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together </li></ul></ul>
    8. 8. The Solution: “Skolem Functions” <ul><li>Skolem functions: </li></ul><ul><ul><li>Conceptual “perfect” hash functions </li></ul></ul><ul><ul><li>Each function returns a unique, deterministic value for each combination of input values </li></ul></ul><ul><ul><li>Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values) </li></ul></ul><ul><li>Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values </li></ul><ul><ul><li>They’re just a way of logically generating “special NULLs” </li></ul></ul>
    9. 9. Revisiting Our Example <ul><li>Query: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>Mapping rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Inverse rules: </li></ul><ul><ul><ul><ul><li>author(a, f(a,t) , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( f(a,t) , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>Expand the query as follows: </li></ul><ul><ul><li>q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t) </li></ul></ul><ul><ul><li>q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t) </li></ul></ul>
    10. 10. Query Answering Using Inverse Rules <ul><li>Invert all rules using the procedures described </li></ul><ul><li>Take the query and the possible rule expansions and execute them in a Datalog interpreter </li></ul><ul><ul><li>In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources </li></ul></ul><ul><ul><li>Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent) </li></ul></ul><ul><li>More efficient, but equivalent, algorithms now exist: </li></ul><ul><ul><li>Bucket algorithm [Levy et al.] </li></ul></ul><ul><ul><li>MiniCon [Pottinger & Halevy] </li></ul></ul><ul><ul><li>Also related: “chase and backchase” [Popa, Tannen, Deutsch] </li></ul></ul>
    11. 11. Summary of Data Integration <ul><li>Local-as-view integration has replaced global-as-view as the standard </li></ul><ul><ul><li>More robust way of defining mediated schemas and sources </li></ul></ul><ul><ul><li>Mediated schema is clearly defined, less likely to change </li></ul></ul><ul><ul><li>Sources can be more accurately described </li></ul></ul><ul><li>Methods exist for query reformulation, including inverse rules </li></ul><ul><li>Integration requires standardization on a single schema </li></ul><ul><ul><li>Can be hard to get consensus </li></ul></ul><ul><ul><li>Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.] </li></ul></ul><ul><li>Some other aspects of integration were addressed in related papers </li></ul><ul><ul><li>Overlap between sources; coverage of data at sources </li></ul></ul><ul><ul><li>Semi-automated creation of mappings and wrappers </li></ul></ul><ul><li>Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies </li></ul>
    12. 12. Performance: What Governs It? <ul><li>Speed of the machine – of course! </li></ul><ul><li>But also many software-controlled factors that we must understand: </li></ul><ul><ul><li>Caching and buffer management </li></ul></ul><ul><ul><li>How the data is stored – physical layout, partitioning </li></ul></ul><ul><ul><li>Auxiliary structures – indices </li></ul></ul><ul><ul><li>Locking and concurrency control (we’ll talk about this later) </li></ul></ul><ul><ul><li>Different algorithms for operations – query execution </li></ul></ul><ul><ul><li>Different orderings for execution – query optimization </li></ul></ul><ul><ul><li>Reuse of materialized views, merging of query subexpressions – answering queries using views; multi-query optimization </li></ul></ul>
    13. 13. Our General Emphasis <ul><li>Goal: cover basic principles that are applied throughout database system design </li></ul><ul><li>Use the appropriate strategy in the appropriate place </li></ul><ul><ul><li>Every (reasonable) algorithm is good somewhere </li></ul></ul><ul><li>… And a corollary: database people reinvent a lot of things and add minor tweaks… </li></ul>
    14. 14. What’s the “Base” in “Database”? <ul><li>Could just be a file with random access </li></ul><ul><ul><li>What are the advantages and disadvantages? </li></ul></ul><ul><li>DBs generally require “raw” disk access </li></ul><ul><ul><li>Need to know when a page is actually written to disk, vs. queued by the OS </li></ul></ul><ul><ul><li>Predictable performance, less fragmentation </li></ul></ul><ul><ul><li>May want to exploit striping or contiguous regions </li></ul></ul><ul><ul><li>Typically divided into “extents” and pages </li></ul></ul>
    15. 15. Buffer Management <ul><li>Could keep DB in RAM </li></ul><ul><ul><li>“ Main-memory DBs” like TimesTen </li></ul></ul><ul><li>But many DBs are still too big; we read & replace pages </li></ul><ul><ul><li>May need to force to disk or pin in buffer </li></ul></ul><ul><li>Policies for page replacement , prefetching </li></ul><ul><ul><li>LRU, as in Operating Systems (not as good as you might think – why not? ) </li></ul></ul><ul><ul><li>MRU (one-time sequential scans) </li></ul></ul><ul><ul><li>Clock, etc. </li></ul></ul><ul><ul><li>DBMIN (min # pages, local policy) </li></ul></ul>Buffer Mgr Tuple Reads/Writes
    16. 16. Storing Tuples in Pages <ul><li>Tuples </li></ul><ul><ul><li>Many possible layouts </li></ul></ul><ul><ul><ul><li>Dynamic vs. fixed lengths </li></ul></ul></ul><ul><ul><ul><li>Ptrs, lengths vs. slots </li></ul></ul></ul><ul><ul><li>Tuples grow down, directories grow up </li></ul></ul><ul><ul><li>Identity and relocation </li></ul></ul><ul><li>Objects and XML are harder </li></ul><ul><ul><li>Horizontal, path, vertical partitioning </li></ul></ul><ul><ul><li>Generally no algorithmic way of deciding </li></ul></ul><ul><li>Generally want to leave some space for insertions </li></ul>t1 t2 t3
    17. 17. Alternatives for Organizing Files <ul><li>Many alternatives, each ideal for some situation, and poor for others : </li></ul><ul><ul><li>Heap files: for full file scans or frequent updates </li></ul></ul><ul><ul><ul><li>Data unordered </li></ul></ul></ul><ul><ul><ul><li>Write new data at end </li></ul></ul></ul><ul><ul><li>Sorted Files: if retrieved in sort order or want range </li></ul></ul><ul><ul><ul><li>Need external sort or an index to keep sorted </li></ul></ul></ul><ul><ul><li>Hashed Files: if selection on equality </li></ul></ul><ul><ul><ul><li>Collection of buckets with primary & overflow pages </li></ul></ul></ul><ul><ul><ul><li>Hashing function over search key attributes </li></ul></ul></ul>
    18. 18. Model for Analyzing Access Costs <ul><li>We ignore CPU costs, for simplicity: </li></ul><ul><ul><li>p(T): The number of data pages in table T </li></ul></ul><ul><ul><li>r(T): Number of records in table T </li></ul></ul><ul><ul><li>D: (Average) time to read or write disk page </li></ul></ul><ul><ul><li>Measuring number of page I/O’s ignores gains of pre-fetching blocks of pages; thus, I/O cost is only approximated. </li></ul></ul><ul><ul><li>Average-case analysis; based on several simplistic assumptions. </li></ul></ul><ul><ul><li>Good enough to show the overall trends! </li></ul></ul>
    19. 19. Assumptions in Our Analysis <ul><li>Single record insert and delete </li></ul><ul><li>Heap files: </li></ul><ul><ul><li>Equality selection on key; exactly one match </li></ul></ul><ul><ul><li>Insert always at end of file </li></ul></ul><ul><li>Sorted files: </li></ul><ul><ul><li>Files compacted after deletions </li></ul></ul><ul><ul><li>Selections on sort field(s) </li></ul></ul><ul><li>Hashed files: </li></ul><ul><ul><li>No overflow buckets, 80% page occupancy </li></ul></ul>
    20. 20. Cost of Operations Delete Insert Range Search Equality Search Scan all recs Hashed File Sorted File Heap File
    21. 21. Cost of Operations <ul><li>Several assumptions underlie these (rough) estimates! </li></ul>2D Search + p(T) D Search + D Delete 2D Search + p(T) D 2D Insert 1.25 p(T) D D log 2 p(T) + (# pages with matches) p(T) D Range Search D D log 2 p(T) p(T) D / 2 Equality Search 1.25 p(T) D p(T) D p(T) D Scan all recs Hashed File Sorted File Heap File
    22. 22. Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Sorting </li></ul></ul><ul><ul><li>Hashing </li></ul></ul>
    23. 23. Technique I: Indexing <ul><li>An index on a file speeds up selections on the search key attributes for the index (trade space for speed). </li></ul><ul><ul><li>Any subset of the fields of a relation can be the search key for an index on the relation. </li></ul></ul><ul><ul><li>Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). </li></ul></ul><ul><li>An index contains a collection of data entries , and supports efficient retrieval of all data entries k* with a given key value k . </li></ul>
    24. 24. Alternatives for Data Entry k* in Index <ul><li>Three alternatives: </li></ul><ul><ul><li>Data record with key value k </li></ul></ul><ul><ul><ul><li>Clustered  fast lookup </li></ul></ul></ul><ul><ul><ul><li>Index is large; only 1 can exist </li></ul></ul></ul><ul><ul><li>< k , rid of data record with search key value k > , OR </li></ul></ul><ul><ul><li>< k , list of rids of data records with search key k > </li></ul></ul><ul><ul><ul><li>Can have secondary indices </li></ul></ul></ul><ul><ul><ul><li>Smaller index may mean faster lookup </li></ul></ul></ul><ul><ul><ul><li>Often not clustered  more expensive to use </li></ul></ul></ul><ul><li>Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k . </li></ul>
    25. 25. Classes of Indices <ul><li>Primary vs. secondary : primary has primary key </li></ul><ul><li>Clustered vs. unclustered : order of records and index approximately same </li></ul><ul><ul><li>Alternative 1 implies clustered, but not vice-versa </li></ul></ul><ul><ul><li>A file can be clustered on at most one search key </li></ul></ul><ul><li>Dense vs. Sparse : dense has index entry per data value; sparse may “skip” some </li></ul><ul><ul><li>Alternative 1 always leads to dense index </li></ul></ul><ul><ul><li>Every sparse index is clustered! </li></ul></ul><ul><ul><li>Sparse indexes are smaller; however, some useful optimizations are based on dense indexes </li></ul></ul>
    26. 26. Clustered vs. Unclustered Index <ul><li>Suppose Index Alternative (2) used, records are stored in Heap file </li></ul><ul><ul><li>Perhaps initially sort data file, leave some gaps </li></ul></ul><ul><ul><li>Inserts may require overflow pages </li></ul></ul>Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
    27. 27. B+ Tree: The DB World’s Favorite Index <ul><li>Insert/delete at log F N cost </li></ul><ul><ul><li>(F = fanout, N = # leaf pages) </li></ul></ul><ul><ul><li>Keep tree height-balanced </li></ul></ul><ul><li>Minimum 50% occupancy (except for root). </li></ul><ul><li>Each node contains d <= m <= 2 d entries. d is called the order of the tree. </li></ul><ul><li>Supports equality and range searches efficiently. </li></ul>Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
    28. 28. Example B+ Tree <ul><li>Search begins at root, and key comparisons direct it to a leaf. </li></ul><ul><li>Search for 5*, 15*, all data entries >= 24* ... </li></ul><ul><li>Based on the search for 15*, we know it is not in the tree! </li></ul>Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13
    29. 29. B+ Trees in Practice <ul><li>Typical order: 100. Typical fill-factor: 67%. </li></ul><ul><ul><li>average fanout = 133 </li></ul></ul><ul><li>Typical capacities: </li></ul><ul><ul><li>Height 4: 1334 = 312,900,700 records </li></ul></ul><ul><ul><li>Height 3: 1333 = 2,352,637 records </li></ul></ul><ul><li>Can often hold top levels in buffer pool: </li></ul><ul><ul><li>Level 1 = 1 page = 8 Kbytes </li></ul></ul><ul><ul><li>Level 2 = 133 pages = 1 Mbyte </li></ul></ul><ul><ul><li>Level 3 = 17,689 pages = 133 MBytes </li></ul></ul>
    30. 30. Inserting Data into a B+ Tree <ul><li>Find correct leaf L. </li></ul><ul><li>Put data entry onto L. </li></ul><ul><ul><li>If L has enough space, done! </li></ul></ul><ul><ul><li>Else, must split L (into L and a new node L2) </li></ul></ul><ul><ul><ul><li>Redistribute entries evenly, copy up middle key. </li></ul></ul></ul><ul><ul><ul><li>Insert index entry pointing to L2 into parent of L. </li></ul></ul></ul><ul><li>This can happen recursively </li></ul><ul><ul><li>To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) </li></ul></ul><ul><li>Splits “grow” tree; root split increases height. </li></ul><ul><ul><li>Tree growth: gets wider or one level taller at top . </li></ul></ul>
    31. 31. Inserting 8* into Example B+ Tree <ul><li>Observe how minimum occupancy is guaranteed in both leaf and index pg splits. </li></ul><ul><li>Recall that all data items are in leaves, and partition values for keys are in intermediate nodes </li></ul><ul><ul><li>Note difference between copy-up and push-up . </li></ul></ul>
    32. 32. Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert here; no room, so split & copy up: 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) 8*
    33. 33. Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to split node & push up 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
    34. 34. Deleting Data from a B+ Tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. </li></ul><ul><ul><li>If L is at least half-full, done! </li></ul></ul><ul><ul><li>If L has only d-1 entries, </li></ul></ul><ul><ul><ul><li>Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). </li></ul></ul></ul><ul><ul><ul><li>If re-distribution fails, merge L and sibling. </li></ul></ul></ul><ul><li>If merge occurred, must delete entry (pointing to L or sibling) from parent of L. </li></ul><ul><li>Merge could propagate to root, decreasing height. </li></ul>
    35. 35. B+ Tree Summary <ul><li>B+ tree and other indices ideal for range searches, good for equality searches. </li></ul><ul><ul><li>Inserts/deletes leave tree height-balanced; log F N cost. </li></ul></ul><ul><ul><li>High fanout (F) means depth rarely more than 3 or 4. </li></ul></ul><ul><ul><li>Almost always better than maintaining a sorted file. </li></ul></ul><ul><ul><li>Typically, 67% occupancy on average. </li></ul></ul><ul><ul><li>Note: Order (d) concept replaced by physical space criterion in practice (“at least half-full”). </li></ul></ul><ul><ul><ul><li>Records may be variable sized </li></ul></ul></ul><ul><ul><ul><li>Index pages typically hold more entries than leaves </li></ul></ul></ul>
    36. 36. Other Kinds of Indices <ul><li>Multidimensional indices </li></ul><ul><ul><li>R-trees, kD-trees, … </li></ul></ul><ul><li>Text indices </li></ul><ul><ul><li>Inverted indices </li></ul></ul><ul><li>Structural indices </li></ul><ul><ul><li>Object indices: access support relations, path indices </li></ul></ul><ul><ul><li>XML and graph indices: dataguides, 1-indices, d(k) indices </li></ul></ul><ul><ul><ul><li>Describe parent-child, path relationships </li></ul></ul></ul>
    37. 37. Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Sorting </li></ul></ul><ul><ul><li>Hashing </li></ul></ul>
    38. 38. Technique II: Sorting <ul><li>Pass 1: Read a page, sort it, write it. </li></ul><ul><ul><li>only one buffer page is used </li></ul></ul><ul><li>Pass 2, 3, …, etc.: </li></ul><ul><ul><li>three buffer pages used. </li></ul></ul>Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk Disk
    39. 39. Two-Way External Merge Sort <ul><li>Each pass we read, write each page in file. </li></ul><ul><li>N pages in the file => the number of passes </li></ul><ul><li>Total cost is: </li></ul><ul><li>Idea: Divide and conquer: sort subfiles and merge </li></ul>Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 5,6 2,6 4,9 7,8 1,3 2 2,3 4,6 4,7 8,9 1,3 5,6 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8
    40. 40. General External Merge Sort <ul><li>To sort a file with N pages using B buffer pages: </li></ul><ul><ul><li>Pass 0: use B buffer pages. Produce sorted runs of B pages each. </li></ul></ul><ul><ul><li>Pass 2, …, etc.: merge B-1 runs. </li></ul></ul>B Main memory buffers INPUT 1 INPUT B-1 OUTPUT Disk Disk INPUT 2 . . . . . . . . . <ul><li>How can we utilize more than 3 buffer pages? </li></ul>
    41. 41. Cost of External Merge Sort <ul><li>Number of passes: </li></ul><ul><li>Cost = 2N * (# of passes) </li></ul><ul><li>With 5 buffer pages, to sort 108 page file: </li></ul><ul><ul><li>Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages) </li></ul></ul><ul><ul><li>Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages) </li></ul></ul><ul><ul><li>Pass 2: 2 sorted runs, 80 pages and 28 pages </li></ul></ul><ul><ul><li>Pass 3: Sorted file of 108 pages </li></ul></ul>
    42. 42. Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Sorting </li></ul></ul><ul><ul><li>Hashing </li></ul></ul>
    43. 43. Technique 3: Hashing <ul><li>A familiar idea: </li></ul><ul><ul><li>Requires “good” hash function (may depend on data) </li></ul></ul><ul><ul><li>Distribute data across buckets </li></ul></ul><ul><ul><li>Often multiple items in same bucket (buckets might overflow) </li></ul></ul><ul><li>Types of hash tables: </li></ul><ul><ul><li>Static </li></ul></ul><ul><ul><li>Extendible (requires directory to buckets; can split) </li></ul></ul><ul><ul><li>Linear (two levels, rotate through + split; bad with skew) </li></ul></ul><ul><ul><li>Can be the basis of disk-based indices! </li></ul></ul><ul><ul><ul><li>We won’t get into detail because of time, but see text </li></ul></ul></ul>

    ×