• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • 2
  • 3
  • 4
  • 7
  • 8
  • 11
  • 12
  • 9
  • 10
  • 6
  • 12
  • 14
  • 23

11/15 11/15 Presentation Transcript

  • Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005
  • Mappings between Schemas
    • LSD provides attribute correspondences, but not complete mappings
    • Mappings generally are posed as views : define relations in one schema (typically either the mediated schema or the source schema), given data in the other schema
      • This allows us to “restructure” or “recompose + decompose” our data in a new way
    • We can also define mappings between values in a view
      • We use an intermediate table defining correspondences – a “concordance table”
      • It can be filled in using some type of code, and corrected by hand
  • A Few Mapping Examples
    • Movie(Title, Year, Director, Editor, Star1, Star2)
    • Movie(Title, Year, Director, Editor, Star1, Star2)
    • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
    • MotionPicture(ID, Title, Year) Participant(ID, Name, Role)
    PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 Need a concordance table from CustIDs to PennIDs Smith, J. 1234 CustName CustID John Smith 46732 EmpName PennID
  • Two Important Approaches
    • TSIMMIS [Garcia-Molina+97] – Stanford
      • Focus: semistructured data (OEM), OQL-based language (Lorel)
      • Creates a mediated schema as a view over the sources
      • Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems
      • Other important systems of this vein: Kleisli/K2 @ Penn
    • Information Manifold [Levy+96] – AT&T Research
      • Focus: local-as-view mappings, relational model
      • Sources defined as views over mediated schema
        • Requires a special
      • Led to peer-to-peer integration approaches (Piazza, etc.)
    • Focus: Web-based queriable sources
    • One of the first systems to support semi-structured data, which predated XML by several years: “OEM”
    • An instance of a “global-as-view” mediation system
      • We define our global schema as views over the sources
  • XML vs. Object Exchange Model <book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title> </book> <book> <author>Chamberlin</author> <title>DB2 UDB</title> </book> O1: book { O2: author { Bernstein } O3: author { Newcomer } O4: title { Principles of TP } } O5: book { O6: author { Chamberlin } O7: title { DB2 UDB } }
  • Queries in TSIMMIS
    • Specified in OQL-style language called Lorel
      • OQL was an object-oriented query language that looks like SQL
      • Lorel is, in many ways, a predecessor to XQuery
    • Based on path expressions over OEM structures:
    • select book where book.title = “DB2 UDB” and book.author = “Chamberlin”
    • This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated =
      • for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
  • Query Answering in TSIMMIS
    • Basically, it’s view unfolding , i.e., composing a query with a view
      • The query is the one being asked
      • The views are the MSL templates for the wrappers
      • Some of the views may actually require parameters, e.g., an author name, before they’ll return answers
        • Common for web forms (see Amazon, Google, …)
        • XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action
  • A Wrapper Definition in MSL
    • Wrappers have templates and binding patterns ($X) in MSL:
    • B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //
      • This reformats a SQL query over Book(author, year, title)
    • In XQuery, this might look like:
      • define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +”’”) return <book>{$b/title}<author>$x</author></book>
      • }
    book title author … … … The union of GetBook’s results is unioned with others to form the view Mediator()
  • How to Answer the Query
    • Given our query:
      • for $b in Mediator()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
    • Find all wrapper definitions that:
      • Contain output enough “structure” to match the conditions of the query
      • Or have already tested the conditions for us!
  • Query Composition with Views
    • We find all views that define book with author and title, and we compose the query with each:
      • define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book> {$b/ title } <author> {$x}</author></book>
      • }
      • for $b in Mediator()/ book where $b/ title /text() = “DB2 UDB” and $b/ author /text() = “Chamberlin” return $b
    book title author … …
  • Matching View Output to Our Query’s Conditions
    • Determine that $b/book/author/text()  $x by matching the pattern on the function’s output:
      • define function GetBook( $x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author> </book>
      • }
      • let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b
    book title author … …
  • The Final Step: Unfolding
      • let $x := “Chamberlin” for $b in ( for $b’ in sql(“Amazon.com”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author></book>
      • )/book where $b/title/text() = “DB2 UDB” return $b
    • How do we simplify further to get to here?
      • for $b in sql(“Amazon.com”, “select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b
  • Virtues of TSIMMIS
    • Early adopter of semistructured data, greatly predating XML
      • Can support data from many different kinds of sources
      • Obviously, doesn’t fully solve heterogeneity problem
    • Presents a mediated schema that is the union of multiple views
      • Query answering based on view unfolding
    • Easily composed in a hierarchy of mediators
  • Limitations of TSIMMIS’ Approach
    • Some data sources may contain data with certain ranges or properties
      • “ Books by Aho”, “Students at UPenn”, …
      • If we ask a query for students at Columbia, don’t want to bother querying students at Penn…
      • How do we express these?
    • Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema
  • An Alternate Approach: The Information Manifold (Levy et al.)
    • When you integrate something, you have some conceptual model of the integrated domain
      • Define that as a basic frame of reference, everything else as a view over it
      • “ Local as View”
    • May have overlapping/incomplete sources
      • Define each source as the subset of a query over the mediated schema
      • We can use selection or join predicates to specify that a source contains a range of values :
        • ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
  • The Local-as-View Model
    • The basic model is the following:
      • “Local” sources are views over the mediated schema
      • Sources have the data – mediated schema is virtual
      • Sources may not have all the data from the domain – “open-world assumption”
    • The system must use the sources (views) to answer queries over the mediated schema
  • Query Answering
    • Assumption: conjunctive queries , set semantics
    • Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)
    • Suppose we have the query:
    • q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
    • and sources:
          • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
          • s5(a, t, p)  author(a, i, _), book(i,t), p = “SAMS”
    • We want to compose the query with the source mappings – but they’re in the wrong direction!
    • Yet: everything in s1, s5 is an answer to the query!
  • Answering Queries Using Views
    • Numerous recently-developed algorithms for these
      • Inverse rules [Duschka et al.]
      • Bucket algorithm [Levy et al.]
      • MiniCon [Pottinger & Halevy]
      • Also related: “chase and backchase” [Popa, Tannen, Deutsch]
    • Requires conjunctive queries
  • Summary of Data Integration
    • Local-as-view integration has replaced global-as-view as the standard
      • More robust way of defining mediated schemas and sources
      • Mediated schema is clearly defined, less likely to change
      • Sources can be more accurately described
    • Methods exist for query reformulation, including inverse rules
    • Integration requires standardization on a single schema
      • Can be hard to get consensus
      • Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.]
    • Some other aspects of integration were addressed in related papers
      • Overlap between sources; coverage of data at sources
      • Semi-automated creation of mappings and wrappers
    • Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies
  • Performance: What Governs It?
    • Speed of the machine – of course!
    • But also many software-controlled factors that we must understand:
      • Caching and buffer management
      • How the data is stored – physical layout, partitioning
      • Auxiliary structures – indices
      • Locking and concurrency control (we’ll talk about this later)
      • Different algorithms for operations – query execution
      • Different orderings for execution – query optimization
      • Reuse of materialized views, merging of query subexpressions – answering queries using views; multi-query optimization
  • Our General Emphasis
    • Goal: cover basic principles that are applied throughout database system design
    • Use the appropriate strategy in the appropriate place
      • Every (reasonable) algorithm is good somewhere
    • … And a corollary: database people reinvent a lot of things and add minor tweaks…
  • What’s the “Base” in “Database”?
    • Could just be a file with random access
      • What are the advantages and disadvantages?
    • DBs generally require “raw” disk access
      • Need to know when a page is actually written to disk, vs. queued by the OS
      • Predictable performance, less fragmentation
      • May want to exploit striping or contiguous regions
      • Typically divided into “extents” and pages
  • Buffer Management
    • Could keep DB in RAM
      • “ Main-memory DBs” like TimesTen
    • But many DBs are still too big; we read & replace pages
      • May need to force to disk or pin in buffer
    • Policies for page replacement , prefetching
      • LRU, as in Operating Systems (not as good as you might think – why not? )
      • MRU (one-time sequential scans)
      • Clock, etc.
      • DBMIN (min # pages, local policy)
    Buffer Mgr Tuple Reads/Writes
  • Storing Tuples in Pages
    • Tuples
      • Many possible layouts
        • Dynamic vs. fixed lengths
        • Ptrs, lengths vs. slots
      • Tuples grow down, directories grow up
      • Identity and relocation
    • Objects and XML are harder
      • Horizontal, path, vertical partitioning
      • Generally no algorithmic way of deciding
    • Generally want to leave some space for insertions
    t1 t2 t3
  • Alternatives for Organizing Files
    • Many alternatives, each ideal for some situation, and poor for others :
      • Heap files: for full file scans or frequent updates
        • Data unordered
        • Write new data at end
      • Sorted Files: if retrieved in sort order or want range
        • Need external sort or an index to keep sorted
      • Hashed Files: if selection on equality
        • Collection of buckets with primary & overflow pages
        • Hashing function over search key attributes
  • Model for Analyzing Access Costs
    • We ignore CPU costs, for simplicity:
      • p(T): The number of data pages in table T
      • r(T): Number of records in table T
      • D: (Average) time to read or write disk page
      • Measuring number of page I/O’s ignores gains of pre-fetching blocks of pages; thus, I/O cost is only approximated.
      • Average-case analysis; based on several simplistic assumptions.
      • Good enough to show the overall trends!
  • Assumptions in Our Analysis
    • Single record insert and delete
    • Heap files:
      • Equality selection on key; exactly one match
      • Insert always at end of file
    • Sorted files:
      • Files compacted after deletions
      • Selections on sort field(s)
    • Hashed files:
      • No overflow buckets, 80% page occupancy
  • Cost of Operations
    • Several assumptions underlie these (rough) estimates!
    2D Search + p(T) D Search + D Delete 2D Search + p(T) D 2D Insert 1.25 p(T) D D log 2 p(T) + (# pages with matches) p(T) D Range Search D D log 2 p(T) p(T) D / 2 Equality Search 1.25 p(T) D p(T) D p(T) D Scan all recs Hashed File Sorted File Heap File
  • Speeding Operations over Data
    • Three general data organization techniques:
      • Indexing
      • Sorting
      • Hashing
  • Technique I: Indexing
    • An index on a file speeds up selections on the search key attributes for the index (trade space for speed).
      • Any subset of the fields of a relation can be the search key for an index on the relation.
      • Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).
    • An index contains a collection of data entries , and supports efficient retrieval of all data entries k* with a given key value k .
  • Alternatives for Data Entry k* in Index
    • Three alternatives:
      • Data record with key value k
        • Clustered  fast lookup
        • Index is large; only 1 can exist
      • < k , rid of data record with search key value k > , OR
      • < k , list of rids of data records with search key k >
        • Can have secondary indices
        • Smaller index may mean faster lookup
        • Often not clustered  more expensive to use
    • Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k .
  • Classes of Indices
    • Primary vs. secondary : primary has primary key
    • Clustered vs. unclustered : order of records and index approximately same
      • Alternative 1 implies clustered, but not vice-versa
      • A file can be clustered on at most one search key
    • Dense vs. Sparse : dense has index entry per data value; sparse may “skip” some
      • Alternative 1 always leads to dense index
      • Every sparse index is clustered!
      • Sparse indexes are smaller; however, some useful optimizations are based on dense indexes
  • Clustered vs. Unclustered Index
    • Suppose Index Alternative (2) used, records are stored in Heap file
      • Perhaps initially sort data file, leave some gaps
      • Inserts may require overflow pages
    Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
  • B+ Tree: The DB World’s Favorite Index
    • Insert/delete at log F N cost
      • (F = fanout, N = # leaf pages)
      • Keep tree height-balanced
    • Minimum 50% occupancy (except for root).
    • Each node contains d <= m <= 2 d entries. d is called the order of the tree.
    • Supports equality and range searches efficiently.
    Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
  • Example B+ Tree
    • Search begins at root, and key comparisons direct it to a leaf.
    • Search for 5*, 15*, all data entries >= 24* ...
    • Based on the search for 15*, we know it is not in the tree!
    Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13
  • B+ Trees in Practice
    • Typical order: 100. Typical fill-factor: 67%.
      • average fanout = 133
    • Typical capacities:
      • Height 4: 1334 = 312,900,700 records
      • Height 3: 1333 = 2,352,637 records
    • Can often hold top levels in buffer pool:
      • Level 1 = 1 page = 8 Kbytes
      • Level 2 = 133 pages = 1 Mbyte
      • Level 3 = 17,689 pages = 133 MBytes
  • Inserting Data into a B+ Tree
    • Find correct leaf L.
    • Put data entry onto L.
      • If L has enough space, done!
      • Else, must split L (into L and a new node L2)
        • Redistribute entries evenly, copy up middle key.
        • Insert index entry pointing to L2 into parent of L.
    • This can happen recursively
      • To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
    • Splits “grow” tree; root split increases height.
      • Tree growth: gets wider or one level taller at top .
  • Inserting 8* into Example B+ Tree
    • Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
    • Recall that all data items are in leaves, and partition values for keys are in intermediate nodes
      • Note difference between copy-up and push-up .
  • Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert here; no room, so split & copy up: 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) 8*
  • Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to split node & push up 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
  • Deleting Data from a B+ Tree
    • Start at root, find leaf L where entry belongs.
    • Remove the entry.
      • If L is at least half-full, done!
      • If L has only d-1 entries,
        • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
        • If re-distribution fails, merge L and sibling.
    • If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
    • Merge could propagate to root, decreasing height.
  • B+ Tree Summary
    • B+ tree and other indices ideal for range searches, good for equality searches.
      • Inserts/deletes leave tree height-balanced; log F N cost.
      • High fanout (F) means depth rarely more than 3 or 4.
      • Almost always better than maintaining a sorted file.
      • Typically, 67% occupancy on average.
      • Note: Order (d) concept replaced by physical space criterion in practice (“at least half-full”).
        • Records may be variable sized
        • Index pages typically hold more entries than leaves
  • Other Kinds of Indices
    • Multidimensional indices
      • R-trees, kD-trees, …
    • Text indices
      • Inverted indices
    • Structural indices
      • Object indices: access support relations, path indices
      • XML and graph indices: dataguides, 1-indices, d(k) indices
        • Describe parent-child, path relationships