Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems ...
Mappings between Schemas <ul><li>LSD provides attribute correspondences, but not complete mappings </li></ul><ul><li>Mappi...
A Few Mapping Examples <ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>Movie(Title, Year, Dir...
Two Important Approaches <ul><li>TSIMMIS  [Garcia-Molina+97]  – Stanford </li></ul><ul><ul><li>Focus:  semistructured data...
TSIMMIS <ul><li>One of the first systems to support semi-structured data, which predated XML by several years: “OEM” </li>...
XML vs. Object Exchange Model <book>   <author>Bernstein</author>   <author>Newcomer</author>   <title>Principles of TP</t...
Queries in TSIMMIS <ul><li>Specified in OQL-style language called Lorel </li></ul><ul><ul><li>OQL was an object-oriented q...
Query Answering in TSIMMIS <ul><li>Basically, it’s  view unfolding , i.e., composing a query with a view </li></ul><ul><ul...
A Wrapper Definition in MSL <ul><li>Wrappers have templates and binding patterns ($X) in MSL: </li></ul><ul><li>B :- B: <b...
How to Answer the Query <ul><li>Given our query: </li></ul><ul><ul><li>for $b in Mediator()/book where $b/title/text() = “...
Query Composition with Views <ul><li>We find all views that define book with author and title, and we  compose  the query ...
Matching View Output to  Our Query’s Conditions <ul><li>Determine that $b/book/author/text()    $x by matching the patte...
The Final Step:  Unfolding <ul><ul><li>let $x := “Chamberlin” for $b in (  for $b’ in  sql(“Amazon.com”, “select * from bo...
Virtues of TSIMMIS <ul><li>Early adopter of semistructured data, greatly predating XML </li></ul><ul><ul><li>Can support d...
Limitations of TSIMMIS’ Approach <ul><li>Some data sources may contain data with certain ranges or properties </li></ul><u...
An Alternate Approach: The Information Manifold (Levy et al.) <ul><li>When you integrate something, you have some conceptu...
The Local-as-View Model <ul><li>The basic model is the following: </li></ul><ul><ul><li>“Local” sources are views over the...
Query Answering <ul><li>Assumption:  conjunctive queries ,  set semantics </li></ul><ul><li>Suppose we have a mediated sch...
Answering Queries Using Views <ul><li>Numerous recently-developed algorithms for these </li></ul><ul><ul><li>Inverse rules...
Summary of Data Integration <ul><li>Local-as-view integration has replaced global-as-view as the standard </li></ul><ul><u...
Performance:  What Governs It? <ul><li>Speed of the machine – of course! </li></ul><ul><li>But also many software-controll...
Our General Emphasis <ul><li>Goal: cover basic  principles  that are applied throughout database system design </li></ul><...
What’s the “Base” in “Database”? <ul><li>Could just be a file with random access </li></ul><ul><ul><li>What are the advant...
Buffer Management <ul><li>Could keep DB in RAM </li></ul><ul><ul><li>“ Main-memory DBs” like TimesTen </li></ul></ul><ul><...
Storing Tuples in Pages <ul><li>Tuples </li></ul><ul><ul><li>Many possible layouts </li></ul></ul><ul><ul><ul><li>Dynamic ...
Alternatives for Organizing Files <ul><li>Many alternatives,  each ideal for some situation, and poor for others : </li></...
Model for Analyzing Access Costs <ul><li>We ignore CPU costs, for simplicity: </li></ul><ul><ul><li>p(T):   The number of ...
Assumptions in Our Analysis <ul><li>Single record insert and delete </li></ul><ul><li>Heap files: </li></ul><ul><ul><li>Eq...
Cost of Operations <ul><li>Several assumptions underlie these (rough) estimates! </li></ul>2D Search + p(T) D Search + D D...
Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul...
Technique I: Indexing <ul><li>An  index  on a file speeds up selections on the  search   key   attributes   for the index ...
Alternatives for Data Entry  k*  in Index <ul><li>Three alternatives: </li></ul><ul><ul><li>Data record with key value  k ...
Classes of Indices <ul><li>Primary   vs.   secondary :  primary has primary key </li></ul><ul><li>Clustered   vs.   unclus...
Clustered vs. Unclustered Index <ul><li>Suppose Index Alternative (2) used, records are stored in Heap file </li></ul><ul>...
B+ Tree:  The DB World’s  Favorite Index <ul><li>Insert/delete at log  F  N cost  </li></ul><ul><ul><li>(F = fanout, N = #...
Example B+ Tree <ul><li>Search begins at root, and key comparisons direct it to a leaf. </li></ul><ul><li>Search for 5*, 1...
B+ Trees in Practice <ul><li>Typical order: 100.  Typical fill-factor: 67%. </li></ul><ul><ul><li>average fanout = 133 </l...
Inserting Data into a B+ Tree <ul><li>Find correct leaf L.  </li></ul><ul><li>Put data entry onto L. </li></ul><ul><ul><li...
Inserting 8* into Example B+ Tree <ul><li>Observe how minimum occupancy is guaranteed in both leaf and index pg splits. </...
Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert ...
Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to sp...
Deleting Data from a B+ Tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. <...
B+ Tree Summary <ul><li>B+ tree and other indices ideal for range searches, good for equality searches. </li></ul><ul><ul>...
Other Kinds of Indices <ul><li>Multidimensional indices </li></ul><ul><ul><li>R-trees, kD-trees, … </li></ul></ul><ul><li>...
Upcoming SlideShare
Loading in …5
×

11/15

760 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
760
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 2
  • 3
  • 4
  • 7
  • 8
  • 11
  • 12
  • 9
  • 10
  • 6
  • 12
  • 14
  • 23
  • 11/15

    1. 1. Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005
    2. 2. Mappings between Schemas <ul><li>LSD provides attribute correspondences, but not complete mappings </li></ul><ul><li>Mappings generally are posed as views : define relations in one schema (typically either the mediated schema or the source schema), given data in the other schema </li></ul><ul><ul><li>This allows us to “restructure” or “recompose + decompose” our data in a new way </li></ul></ul><ul><li>We can also define mappings between values in a view </li></ul><ul><ul><li>We use an intermediate table defining correspondences – a “concordance table” </li></ul></ul><ul><ul><li>It can be filled in using some type of code, and corrected by hand </li></ul></ul>
    3. 3. A Few Mapping Examples <ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>PieceOfArt(ID, Artist, Subject, Title, TypeOfArt) </li></ul><ul><li>MotionPicture(ID, Title, Year) Participant(ID, Name, Role) </li></ul>PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 Need a concordance table from CustIDs to PennIDs Smith, J. 1234 CustName CustID John Smith 46732 EmpName PennID
    4. 4. Two Important Approaches <ul><li>TSIMMIS [Garcia-Molina+97] – Stanford </li></ul><ul><ul><li>Focus: semistructured data (OEM), OQL-based language (Lorel) </li></ul></ul><ul><ul><li>Creates a mediated schema as a view over the sources </li></ul></ul><ul><ul><li>Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems </li></ul></ul><ul><ul><li>Other important systems of this vein: Kleisli/K2 @ Penn </li></ul></ul><ul><li>Information Manifold [Levy+96] – AT&T Research </li></ul><ul><ul><li>Focus: local-as-view mappings, relational model </li></ul></ul><ul><ul><li>Sources defined as views over mediated schema </li></ul></ul><ul><ul><ul><li>Requires a special </li></ul></ul></ul><ul><ul><li>Led to peer-to-peer integration approaches (Piazza, etc.) </li></ul></ul><ul><li>Focus: Web-based queriable sources </li></ul>
    5. 5. TSIMMIS <ul><li>One of the first systems to support semi-structured data, which predated XML by several years: “OEM” </li></ul><ul><li>An instance of a “global-as-view” mediation system </li></ul><ul><ul><li>We define our global schema as views over the sources </li></ul></ul>
    6. 6. XML vs. Object Exchange Model <book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title> </book> <book> <author>Chamberlin</author> <title>DB2 UDB</title> </book> O1: book { O2: author { Bernstein } O3: author { Newcomer } O4: title { Principles of TP } } O5: book { O6: author { Chamberlin } O7: title { DB2 UDB } }
    7. 7. Queries in TSIMMIS <ul><li>Specified in OQL-style language called Lorel </li></ul><ul><ul><li>OQL was an object-oriented query language that looks like SQL </li></ul></ul><ul><ul><li>Lorel is, in many ways, a predecessor to XQuery </li></ul></ul><ul><li>Based on path expressions over OEM structures: </li></ul><ul><li>select book where book.title = “DB2 UDB” and book.author = “Chamberlin” </li></ul><ul><li>This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated = </li></ul><ul><ul><li>for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b </li></ul></ul>
    8. 8. Query Answering in TSIMMIS <ul><li>Basically, it’s view unfolding , i.e., composing a query with a view </li></ul><ul><ul><li>The query is the one being asked </li></ul></ul><ul><ul><li>The views are the MSL templates for the wrappers </li></ul></ul><ul><ul><li>Some of the views may actually require parameters, e.g., an author name, before they’ll return answers </li></ul></ul><ul><ul><ul><li>Common for web forms (see Amazon, Google, …) </li></ul></ul></ul><ul><ul><ul><li>XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action </li></ul></ul></ul>
    9. 9. A Wrapper Definition in MSL <ul><li>Wrappers have templates and binding patterns ($X) in MSL: </li></ul><ul><li>B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X // </li></ul><ul><ul><li>This reformats a SQL query over Book(author, year, title) </li></ul></ul><ul><li>In XQuery, this might look like: </li></ul><ul><ul><li>define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +”’”) return <book>{$b/title}<author>$x</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul>book title author … … … The union of GetBook’s results is unioned with others to form the view Mediator()
    10. 10. How to Answer the Query <ul><li>Given our query: </li></ul><ul><ul><li>for $b in Mediator()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b </li></ul></ul><ul><li>Find all wrapper definitions that: </li></ul><ul><ul><li>Contain output enough “structure” to match the conditions of the query </li></ul></ul><ul><ul><li>Or have already tested the conditions for us! </li></ul></ul>
    11. 11. Query Composition with Views <ul><li>We find all views that define book with author and title, and we compose the query with each: </li></ul><ul><ul><li>define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book> {$b/ title } <author> {$x}</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>for $b in Mediator()/ book where $b/ title /text() = “DB2 UDB” and $b/ author /text() = “Chamberlin” return $b </li></ul></ul>book title author … …
    12. 12. Matching View Output to Our Query’s Conditions <ul><li>Determine that $b/book/author/text()  $x by matching the pattern on the function’s output: </li></ul><ul><ul><li>define function GetBook( $x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author> </book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b </li></ul></ul>book title author … …
    13. 13. The Final Step: Unfolding <ul><ul><li>let $x := “Chamberlin” for $b in ( for $b’ in sql(“Amazon.com”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author></book> </li></ul></ul><ul><ul><li> )/book where $b/title/text() = “DB2 UDB” return $b </li></ul></ul><ul><li>How do we simplify further to get to here? </li></ul><ul><ul><li>for $b in sql(“Amazon.com”, “select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b </li></ul></ul>
    14. 14. Virtues of TSIMMIS <ul><li>Early adopter of semistructured data, greatly predating XML </li></ul><ul><ul><li>Can support data from many different kinds of sources </li></ul></ul><ul><ul><li>Obviously, doesn’t fully solve heterogeneity problem </li></ul></ul><ul><li>Presents a mediated schema that is the union of multiple views </li></ul><ul><ul><li>Query answering based on view unfolding </li></ul></ul><ul><li>Easily composed in a hierarchy of mediators </li></ul>
    15. 15. Limitations of TSIMMIS’ Approach <ul><li>Some data sources may contain data with certain ranges or properties </li></ul><ul><ul><li>“ Books by Aho”, “Students at UPenn”, … </li></ul></ul><ul><ul><li>If we ask a query for students at Columbia, don’t want to bother querying students at Penn… </li></ul></ul><ul><ul><li>How do we express these? </li></ul></ul><ul><li>Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema </li></ul>
    16. 16. An Alternate Approach: The Information Manifold (Levy et al.) <ul><li>When you integrate something, you have some conceptual model of the integrated domain </li></ul><ul><ul><li>Define that as a basic frame of reference, everything else as a view over it </li></ul></ul><ul><ul><li>“ Local as View” </li></ul></ul><ul><li>May have overlapping/incomplete sources </li></ul><ul><ul><li>Define each source as the subset of a query over the mediated schema </li></ul></ul><ul><ul><li>We can use selection or join predicates to specify that a source contains a range of values : </li></ul></ul><ul><ul><ul><li>ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers” </li></ul></ul></ul>
    17. 17. The Local-as-View Model <ul><li>The basic model is the following: </li></ul><ul><ul><li>“Local” sources are views over the mediated schema </li></ul></ul><ul><ul><li>Sources have the data – mediated schema is virtual </li></ul></ul><ul><ul><li>Sources may not have all the data from the domain – “open-world assumption” </li></ul></ul><ul><li>The system must use the sources (views) to answer queries over the mediated schema </li></ul>
    18. 18. Query Answering <ul><li>Assumption: conjunctive queries , set semantics </li></ul><ul><li>Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher) </li></ul><ul><li>Suppose we have the query: </li></ul><ul><li> q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul><ul><li>and sources: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s5(a, t, p)  author(a, i, _), book(i,t), p = “SAMS” </li></ul></ul></ul></ul><ul><li>We want to compose the query with the source mappings – but they’re in the wrong direction! </li></ul><ul><li>Yet: everything in s1, s5 is an answer to the query! </li></ul>
    19. 19. Answering Queries Using Views <ul><li>Numerous recently-developed algorithms for these </li></ul><ul><ul><li>Inverse rules [Duschka et al.] </li></ul></ul><ul><ul><li>Bucket algorithm [Levy et al.] </li></ul></ul><ul><ul><li>MiniCon [Pottinger & Halevy] </li></ul></ul><ul><ul><li>Also related: “chase and backchase” [Popa, Tannen, Deutsch] </li></ul></ul><ul><li>Requires conjunctive queries </li></ul>
    20. 20. Summary of Data Integration <ul><li>Local-as-view integration has replaced global-as-view as the standard </li></ul><ul><ul><li>More robust way of defining mediated schemas and sources </li></ul></ul><ul><ul><li>Mediated schema is clearly defined, less likely to change </li></ul></ul><ul><ul><li>Sources can be more accurately described </li></ul></ul><ul><li>Methods exist for query reformulation, including inverse rules </li></ul><ul><li>Integration requires standardization on a single schema </li></ul><ul><ul><li>Can be hard to get consensus </li></ul></ul><ul><ul><li>Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.] </li></ul></ul><ul><li>Some other aspects of integration were addressed in related papers </li></ul><ul><ul><li>Overlap between sources; coverage of data at sources </li></ul></ul><ul><ul><li>Semi-automated creation of mappings and wrappers </li></ul></ul><ul><li>Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies </li></ul>
    21. 21. Performance: What Governs It? <ul><li>Speed of the machine – of course! </li></ul><ul><li>But also many software-controlled factors that we must understand: </li></ul><ul><ul><li>Caching and buffer management </li></ul></ul><ul><ul><li>How the data is stored – physical layout, partitioning </li></ul></ul><ul><ul><li>Auxiliary structures – indices </li></ul></ul><ul><ul><li>Locking and concurrency control (we’ll talk about this later) </li></ul></ul><ul><ul><li>Different algorithms for operations – query execution </li></ul></ul><ul><ul><li>Different orderings for execution – query optimization </li></ul></ul><ul><ul><li>Reuse of materialized views, merging of query subexpressions – answering queries using views; multi-query optimization </li></ul></ul>
    22. 22. Our General Emphasis <ul><li>Goal: cover basic principles that are applied throughout database system design </li></ul><ul><li>Use the appropriate strategy in the appropriate place </li></ul><ul><ul><li>Every (reasonable) algorithm is good somewhere </li></ul></ul><ul><li>… And a corollary: database people reinvent a lot of things and add minor tweaks… </li></ul>
    23. 23. What’s the “Base” in “Database”? <ul><li>Could just be a file with random access </li></ul><ul><ul><li>What are the advantages and disadvantages? </li></ul></ul><ul><li>DBs generally require “raw” disk access </li></ul><ul><ul><li>Need to know when a page is actually written to disk, vs. queued by the OS </li></ul></ul><ul><ul><li>Predictable performance, less fragmentation </li></ul></ul><ul><ul><li>May want to exploit striping or contiguous regions </li></ul></ul><ul><ul><li>Typically divided into “extents” and pages </li></ul></ul>
    24. 24. Buffer Management <ul><li>Could keep DB in RAM </li></ul><ul><ul><li>“ Main-memory DBs” like TimesTen </li></ul></ul><ul><li>But many DBs are still too big; we read & replace pages </li></ul><ul><ul><li>May need to force to disk or pin in buffer </li></ul></ul><ul><li>Policies for page replacement , prefetching </li></ul><ul><ul><li>LRU, as in Operating Systems (not as good as you might think – why not? ) </li></ul></ul><ul><ul><li>MRU (one-time sequential scans) </li></ul></ul><ul><ul><li>Clock, etc. </li></ul></ul><ul><ul><li>DBMIN (min # pages, local policy) </li></ul></ul>Buffer Mgr Tuple Reads/Writes
    25. 25. Storing Tuples in Pages <ul><li>Tuples </li></ul><ul><ul><li>Many possible layouts </li></ul></ul><ul><ul><ul><li>Dynamic vs. fixed lengths </li></ul></ul></ul><ul><ul><ul><li>Ptrs, lengths vs. slots </li></ul></ul></ul><ul><ul><li>Tuples grow down, directories grow up </li></ul></ul><ul><ul><li>Identity and relocation </li></ul></ul><ul><li>Objects and XML are harder </li></ul><ul><ul><li>Horizontal, path, vertical partitioning </li></ul></ul><ul><ul><li>Generally no algorithmic way of deciding </li></ul></ul><ul><li>Generally want to leave some space for insertions </li></ul>t1 t2 t3
    26. 26. Alternatives for Organizing Files <ul><li>Many alternatives, each ideal for some situation, and poor for others : </li></ul><ul><ul><li>Heap files: for full file scans or frequent updates </li></ul></ul><ul><ul><ul><li>Data unordered </li></ul></ul></ul><ul><ul><ul><li>Write new data at end </li></ul></ul></ul><ul><ul><li>Sorted Files: if retrieved in sort order or want range </li></ul></ul><ul><ul><ul><li>Need external sort or an index to keep sorted </li></ul></ul></ul><ul><ul><li>Hashed Files: if selection on equality </li></ul></ul><ul><ul><ul><li>Collection of buckets with primary & overflow pages </li></ul></ul></ul><ul><ul><ul><li>Hashing function over search key attributes </li></ul></ul></ul>
    27. 27. Model for Analyzing Access Costs <ul><li>We ignore CPU costs, for simplicity: </li></ul><ul><ul><li>p(T): The number of data pages in table T </li></ul></ul><ul><ul><li>r(T): Number of records in table T </li></ul></ul><ul><ul><li>D: (Average) time to read or write disk page </li></ul></ul><ul><ul><li>Measuring number of page I/O’s ignores gains of pre-fetching blocks of pages; thus, I/O cost is only approximated. </li></ul></ul><ul><ul><li>Average-case analysis; based on several simplistic assumptions. </li></ul></ul><ul><ul><li>Good enough to show the overall trends! </li></ul></ul>
    28. 28. Assumptions in Our Analysis <ul><li>Single record insert and delete </li></ul><ul><li>Heap files: </li></ul><ul><ul><li>Equality selection on key; exactly one match </li></ul></ul><ul><ul><li>Insert always at end of file </li></ul></ul><ul><li>Sorted files: </li></ul><ul><ul><li>Files compacted after deletions </li></ul></ul><ul><ul><li>Selections on sort field(s) </li></ul></ul><ul><li>Hashed files: </li></ul><ul><ul><li>No overflow buckets, 80% page occupancy </li></ul></ul>
    29. 29. Cost of Operations <ul><li>Several assumptions underlie these (rough) estimates! </li></ul>2D Search + p(T) D Search + D Delete 2D Search + p(T) D 2D Insert 1.25 p(T) D D log 2 p(T) + (# pages with matches) p(T) D Range Search D D log 2 p(T) p(T) D / 2 Equality Search 1.25 p(T) D p(T) D p(T) D Scan all recs Hashed File Sorted File Heap File
    30. 30. Speeding Operations over Data <ul><li>Three general data organization techniques: </li></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Sorting </li></ul></ul><ul><ul><li>Hashing </li></ul></ul>
    31. 31. Technique I: Indexing <ul><li>An index on a file speeds up selections on the search key attributes for the index (trade space for speed). </li></ul><ul><ul><li>Any subset of the fields of a relation can be the search key for an index on the relation. </li></ul></ul><ul><ul><li>Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). </li></ul></ul><ul><li>An index contains a collection of data entries , and supports efficient retrieval of all data entries k* with a given key value k . </li></ul>
    32. 32. Alternatives for Data Entry k* in Index <ul><li>Three alternatives: </li></ul><ul><ul><li>Data record with key value k </li></ul></ul><ul><ul><ul><li>Clustered  fast lookup </li></ul></ul></ul><ul><ul><ul><li>Index is large; only 1 can exist </li></ul></ul></ul><ul><ul><li>< k , rid of data record with search key value k > , OR </li></ul></ul><ul><ul><li>< k , list of rids of data records with search key k > </li></ul></ul><ul><ul><ul><li>Can have secondary indices </li></ul></ul></ul><ul><ul><ul><li>Smaller index may mean faster lookup </li></ul></ul></ul><ul><ul><ul><li>Often not clustered  more expensive to use </li></ul></ul></ul><ul><li>Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k . </li></ul>
    33. 33. Classes of Indices <ul><li>Primary vs. secondary : primary has primary key </li></ul><ul><li>Clustered vs. unclustered : order of records and index approximately same </li></ul><ul><ul><li>Alternative 1 implies clustered, but not vice-versa </li></ul></ul><ul><ul><li>A file can be clustered on at most one search key </li></ul></ul><ul><li>Dense vs. Sparse : dense has index entry per data value; sparse may “skip” some </li></ul><ul><ul><li>Alternative 1 always leads to dense index </li></ul></ul><ul><ul><li>Every sparse index is clustered! </li></ul></ul><ul><ul><li>Sparse indexes are smaller; however, some useful optimizations are based on dense indexes </li></ul></ul>
    34. 34. Clustered vs. Unclustered Index <ul><li>Suppose Index Alternative (2) used, records are stored in Heap file </li></ul><ul><ul><li>Perhaps initially sort data file, leave some gaps </li></ul></ul><ul><ul><li>Inserts may require overflow pages </li></ul></ul>Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
    35. 35. B+ Tree: The DB World’s Favorite Index <ul><li>Insert/delete at log F N cost </li></ul><ul><ul><li>(F = fanout, N = # leaf pages) </li></ul></ul><ul><ul><li>Keep tree height-balanced </li></ul></ul><ul><li>Minimum 50% occupancy (except for root). </li></ul><ul><li>Each node contains d <= m <= 2 d entries. d is called the order of the tree. </li></ul><ul><li>Supports equality and range searches efficiently. </li></ul>Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
    36. 36. Example B+ Tree <ul><li>Search begins at root, and key comparisons direct it to a leaf. </li></ul><ul><li>Search for 5*, 15*, all data entries >= 24* ... </li></ul><ul><li>Based on the search for 15*, we know it is not in the tree! </li></ul>Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13
    37. 37. B+ Trees in Practice <ul><li>Typical order: 100. Typical fill-factor: 67%. </li></ul><ul><ul><li>average fanout = 133 </li></ul></ul><ul><li>Typical capacities: </li></ul><ul><ul><li>Height 4: 1334 = 312,900,700 records </li></ul></ul><ul><ul><li>Height 3: 1333 = 2,352,637 records </li></ul></ul><ul><li>Can often hold top levels in buffer pool: </li></ul><ul><ul><li>Level 1 = 1 page = 8 Kbytes </li></ul></ul><ul><ul><li>Level 2 = 133 pages = 1 Mbyte </li></ul></ul><ul><ul><li>Level 3 = 17,689 pages = 133 MBytes </li></ul></ul>
    38. 38. Inserting Data into a B+ Tree <ul><li>Find correct leaf L. </li></ul><ul><li>Put data entry onto L. </li></ul><ul><ul><li>If L has enough space, done! </li></ul></ul><ul><ul><li>Else, must split L (into L and a new node L2) </li></ul></ul><ul><ul><ul><li>Redistribute entries evenly, copy up middle key. </li></ul></ul></ul><ul><ul><ul><li>Insert index entry pointing to L2 into parent of L. </li></ul></ul></ul><ul><li>This can happen recursively </li></ul><ul><ul><li>To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) </li></ul></ul><ul><li>Splits “grow” tree; root split increases height. </li></ul><ul><ul><li>Tree growth: gets wider or one level taller at top . </li></ul></ul>
    39. 39. Inserting 8* into Example B+ Tree <ul><li>Observe how minimum occupancy is guaranteed in both leaf and index pg splits. </li></ul><ul><li>Recall that all data items are in leaves, and partition values for keys are in intermediate nodes </li></ul><ul><ul><li>Note difference between copy-up and push-up . </li></ul></ul>
    40. 40. Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert here; no room, so split & copy up: 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) 8*
    41. 41. Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to split node & push up 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
    42. 42. Deleting Data from a B+ Tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. </li></ul><ul><ul><li>If L is at least half-full, done! </li></ul></ul><ul><ul><li>If L has only d-1 entries, </li></ul></ul><ul><ul><ul><li>Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). </li></ul></ul></ul><ul><ul><ul><li>If re-distribution fails, merge L and sibling. </li></ul></ul></ul><ul><li>If merge occurred, must delete entry (pointing to L or sibling) from parent of L. </li></ul><ul><li>Merge could propagate to root, decreasing height. </li></ul>
    43. 43. B+ Tree Summary <ul><li>B+ tree and other indices ideal for range searches, good for equality searches. </li></ul><ul><ul><li>Inserts/deletes leave tree height-balanced; log F N cost. </li></ul></ul><ul><ul><li>High fanout (F) means depth rarely more than 3 or 4. </li></ul></ul><ul><ul><li>Almost always better than maintaining a sorted file. </li></ul></ul><ul><ul><li>Typically, 67% occupancy on average. </li></ul></ul><ul><ul><li>Note: Order (d) concept replaced by physical space criterion in practice (“at least half-full”). </li></ul></ul><ul><ul><ul><li>Records may be variable sized </li></ul></ul></ul><ul><ul><ul><li>Index pages typically hold more entries than leaves </li></ul></ul></ul>
    44. 44. Other Kinds of Indices <ul><li>Multidimensional indices </li></ul><ul><ul><li>R-trees, kD-trees, … </li></ul></ul><ul><li>Text indices </li></ul><ul><ul><li>Inverted indices </li></ul></ul><ul><li>Structural indices </li></ul><ul><ul><li>Object indices: access support relations, path indices </li></ul></ul><ul><ul><li>XML and graph indices: dataguides, 1-indices, d(k) indices </li></ul></ul><ul><ul><ul><li>Describe parent-child, path relationships </li></ul></ul></ul>

    ×