Your SlideShare is downloading. ×
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
11/11
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
373
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 2
  • 3
  • 7
  • 8
  • 11
  • 12
  • 9
  • 10
  • 6
  • 14
  • 23
  • Transcript

    • 1. Data Integration, Concluded Physical Data Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 21, 2010
    • 2. An Alternate Approach: The Information Manifold (Levy et al.)
      • When you integrate something, you have some conceptual model of the integrated domain
        • Define that as a basic frame of reference, everything else as a view over it
        • “ Local as View”
      • May have overlapping/incomplete sources
        • Define each source as the subset of a query over the mediated schema
        • We can use selection or join predicates to specify that a source contains a range of values :
          • ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
    • 3. The Local-as-View Model
      • The basic model is the following:
        • “ Local” sources are views over the mediated schema
        • Sources have the data – mediated schema is virtual
        • Sources may not have all the data from the domain – “open-world assumption”
      • The system must use the sources (views) to answer queries over the mediated schema
    • 4. Query Answering
      • Assumption: conjunctive queries , set semantics
      • Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)
      • Suppose we have the query:
      • q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
      • and sources:
            • s1(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB”
            • s5(a, t, p)  author(a, i, _), book(i,t), p = “SAMS”
      • We want to compose the query with the source mappings – but they’re in the wrong direction!
      • Yet: everything in s1, s5 is an answer to the query!
    • 5. Answering Queries Using Views
      • Numerous recently-developed algorithms for these
        • Inverse rules [Duschka et al.]
        • Bucket algorithm [Levy et al.]
        • MiniCon [Pottinger & Halevy]
        • Also related: “chase and backchase” [Popa, Tannen, Deutsch]
      • Requires conjunctive queries
    • 6. Inverse Rules
      • Reverse the implication in each rule:
            • author(a, i 1 (a,t) , _) :- s1(a,t)
            • book( i 1 (a,t) , t, p 1 (a,t) ) :- s1(a,t), t = “DB2 UDB”
            • author(a, i 5 (a,t,p) , _) :- s5(a, t, p)
            • book( i 5 (a,t,p) , t) :- s5(a, t, p), p = “SAMS”
      • Then unfold:
          • q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
          • q(a, t) :- s1(a, x), i = i 1 (a,x), s1(a,t), t=“DB2 UDB”, i = i 1 (a,t), p=p 1 (a,t), t=“DB2 UDB”
          • q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
          • q(a, t) :- s5(a, x, y), i = i 1 (a,x,y), s5(a,t,p), i = i 1 (a,t,p), p=“SAMS”, t=“DB2 UDB”
    • 7. Summary of Data Integration
      • Local-as-view integration has replaced global-as-view as the standard
        • More robust way of defining mediated schemas and sources
        • Mediated schema is clearly defined, less likely to change
        • Sources can be more accurately described
      • Methods exist for query reformulation, including inverse rules
      • Integration requires standardization on a single schema
        • Can be hard to get consensus
        • Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.]
      • Some other aspects of integration were addressed in related papers
        • Overlap between sources; coverage of data at sources
        • Semi-automated creation of mappings and wrappers
      • Data integration capabilities in commercial products: Oracle Fusion, IBM’s WebSphere Information Integrator, numerous packages from middleware companies; MS BizTalk Mapper, IBM Rational Data Architect
    • 8. Performance: What Governs It?
      • Speed of the machine – of course!
      • But also many software-controlled factors that we must understand:
        • Caching and buffer management
        • How the data is stored – physical layout, partitioning
        • Auxiliary structures – indices
        • Locking and concurrency control (we’ll talk about this later)
        • Different algorithms for operations – query execution
        • Different orderings for execution – query optimization
        • Reuse of materialized views, merging of query subexpressions – answering queries using views; multi-query optimization
    • 9. Our General Emphasis
      • Goal: cover basic principles that are applied throughout database system design
      • Use the appropriate strategy in the appropriate place
        • Every (reasonable) algorithm is good somewhere
      • … And a corollary: database people reinvent a lot of things and add minor tweaks…
    • 10. Storing Tuples in Pages
      • Tuples
        • Many possible layouts
          • Dynamic vs. fixed lengths
          • Ptrs, lengths vs. slots
        • Tuples grow down, directories grow up
        • Identity and relocation
      • Objects and XML are harder
        • Horizontal, path, vertical partitioning
        • Generally no algorithmic way of deciding
      • Generally want to leave some space for insertions
      t1 t2 t3
    • 11. Alternatives for Organizing Files
      • Many alternatives, each ideal for some situation, and poor for others :
        • Heap files: for full file scans or frequent updates
          • Data unordered
          • Write new data at end
        • Sorted Files: if retrieved in sort order or want range
          • Need external sort or an index to keep sorted
        • Hashed Files: if selection on equality
          • Collection of buckets with primary & overflow pages
          • Hashing function over search key attributes
    • 12. Model for Analyzing Access Costs
      • We ignore CPU costs, for simplicity:
        • p(T): The number of data pages in table T
        • r(T): Number of records in table T
        • D: (Average) time to read or write disk page
        • Measuring number of page I/O’s ignores gains of pre-fetching blocks of pages; thus, I/O cost is only approximated.
        • Average-case analysis; based on several simplistic assumptions.
        • Good enough to show the overall trends!
    • 13. Approximate Cost of Operations
      • Several assumptions underlie these (rough) estimates!
      * * No overflow buckets, 80% page occupancy Heap File Sorted File Hashed File Scan all recs p(T) D p(T) D 1.25 p(T) D Equality Search p(T) D / 2 D log 2 p(T) D Range Search p(T) D D log 2 p(T) + (# pages with matches) 1.25 p(T) D Insert 2D Search + p(T) D 2D Delete Search + D Search + p(T) D 2D
    • 14. Speeding Operations over Data
      • Recall that we’re interested in how to get good performance in answering queries
      • The first consideration is how the data is made accessible to the DBMS
        • We saw different arrangements of the tables: Heap (unsorted) files, sorted files, and hashed files
      • Today we look further at 3 core concepts that are used to efficiently support sort- and hash-based access to data:
        • Indexing
        • Sorting
        • Hashing
    • 15. Technique I: Indexing
      • An index on a file speeds up selections on the search key attributes for the index (trade space for speed).
        • Any subset of the fields of a relation can be the search key for an index on the relation.
        • Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).
      • An index contains a collection of data entries , and supports efficient retrieval of all data entries k* with a given key value k .
      • Generally the entries of an index are some form of node in a tree – but should the index contain the data, or pointers to the data?
    • 16. Alternatives for Data Entry k* in Index
      • Three alternatives for where to put the data:
        • Data record wherever key value k appears
          • Clustered  fast lookup
          • Index is large; only 1 can exist
        • < k , rid of data record with search key value k > , OR
        • < k , list of rids of data records with search key k >
          • Can have secondary indices
          • Smaller index may mean faster lookup
          • Often not clustered  more expensive to use
      • Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k
      rid = row id, conceptually a pointer
    • 17. Classes of Indices
      • Primary vs. secondary : primary has the primary key
        • Most DBMSs automatically generate a primary index when you define a primary key
      • Clustered vs. unclustered : order of records and index are approximately the same
        • Alternative 1 implies clustered, but not vice-versa
        • A file can be clustered on at most one search key
      • Dense vs. Sparse : dense has index entry per data value; sparse may “skip” some
        • Alternative 1 always leads to dense index [Why?]
        • Every sparse index is clustered!
        • Sparse indexes are smaller; however, some useful optimizations are based on dense indexes
    • 18. Clustered vs. Unclustered Index
      • Suppose Index Alternative (2) used, with pointers to records stored in a heap file
        • Perhaps initially sort data file, leave some gaps
        • Inserts may require overflow pages
      • Consider how these strategies affect disk caching and access
      Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
    • 19. B+ Tree: The DB World’s Favorite Index
      • Insert/delete at log F N cost
        • (F = fanout, N = # leaf pages)
        • Keep tree height-balanced
      • Minimum 50% occupancy (except for root).
      • Each node contains d <= m <= 2 d entries. d is called the order of the tree.
      • Supports equality and range searches efficiently.
      Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
    • 20. Example B+ Tree
      • Search begins at root, and key comparisons direct it to a leaf.
      • Search for 5*, 15*, all data entries >= 24* ...
      • Based on the search for 15*, we know it is not in the tree!
      Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13
    • 21. B+ Trees in Practice
      • Typical order: 100. Typical fill-factor: 67%.
        • average fanout = 133
      • Typical capacities:
        • Height 4: 1334 = 312,900,700 records
        • Height 3: 1333 = 2,352,637 records
      • Can often hold top levels of tree in buffer pool:
        • Level 1 = 1 page = 8 KB
        • Level 2 = 133 pages = 1 MB
        • Level 3 = 17,689 pages = 133 MB
        • Level 4 = 2,352,637 pages = 18 GB
        • “ Nearly O(1)” access time to data – for equality or range queries!
    • 22. Inserting Data into a B+ Tree
      • Find correct leaf L.
      • Put data entry onto L.
        • If L has enough space, done!
        • Else, must split L (into L and a new node L2)
          • Redistribute entries evenly, copy up middle key.
          • Insert index entry pointing to L2 into parent of L.
      • This can happen recursively
        • To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
      • Splits “grow” tree; root split increases height.
        • Tree growth: gets wider or one level taller at top .
    • 23. Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert here; no room, so split & copy up: 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) 8*
    • 24. Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to split node & push up 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
    • 25. Deleting Data from a B+ Tree
      • Start at root, find leaf L where entry belongs.
      • Remove the entry.
        • If L is at least half-full, done!
        • If L has only d-1 entries,
          • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
          • If re-distribution fails, merge L and sibling.
      • If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
      • Merge could propagate to root, decreasing height.
    • 26. B+ Tree Summary
      • B+ tree and other indices ideal for range searches, good for equality searches.
        • Inserts/deletes leave tree height-balanced; log F N cost.
        • High fanout (F) means depth rarely more than 3 or 4.
        • Almost always better than maintaining a sorted file.
        • Typically, 67% occupancy on average.
        • Note: Order (d) concept replaced by physical space criterion in practice (“at least half-full”).
          • Records may be variable sized
          • Index pages typically hold more entries than leaves

    ×