Upcoming SlideShare
×

# 11/11

• 360 views

• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
360
On Slideshare
0
From Embeds
0
Number of Embeds
0

Shares
7
0
Likes
0

No embeds

### Report content

No notes for slide
• 2
• 3
• 7
• 8
• 11
• 12
• 9
• 10
• 6
• 14
• 23

### Transcript

• 1. Data Integration, Concluded Physical Data Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 21, 2010
• 2. An Alternate Approach: The Information Manifold (Levy et al.)
• When you integrate something, you have some conceptual model of the integrated domain
• Define that as a basic frame of reference, everything else as a view over it
• “ Local as View”
• May have overlapping/incomplete sources
• Define each source as the subset of a query over the mediated schema
• We can use selection or join predicates to specify that a source contains a range of values :
• ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
• 3. The Local-as-View Model
• The basic model is the following:
• “ Local” sources are views over the mediated schema
• Sources have the data – mediated schema is virtual
• Sources may not have all the data from the domain – “open-world assumption”
• The system must use the sources (views) to answer queries over the mediated schema
• Assumption: conjunctive queries , set semantics
• Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)
• Suppose we have the query:
• q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
• and sources:
• s1(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB”
• s5(a, t, p)  author(a, i, _), book(i,t), p = “SAMS”
• We want to compose the query with the source mappings – but they’re in the wrong direction!
• Yet: everything in s1, s5 is an answer to the query!
• 5. Answering Queries Using Views
• Numerous recently-developed algorithms for these
• Inverse rules [Duschka et al.]
• Bucket algorithm [Levy et al.]
• MiniCon [Pottinger & Halevy]
• Also related: “chase and backchase” [Popa, Tannen, Deutsch]
• Requires conjunctive queries
• 6. Inverse Rules
• Reverse the implication in each rule:
• author(a, i 1 (a,t) , _) :- s1(a,t)
• book( i 1 (a,t) , t, p 1 (a,t) ) :- s1(a,t), t = “DB2 UDB”
• author(a, i 5 (a,t,p) , _) :- s5(a, t, p)
• book( i 5 (a,t,p) , t) :- s5(a, t, p), p = “SAMS”
• Then unfold:
• q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
• q(a, t) :- s1(a, x), i = i 1 (a,x), s1(a,t), t=“DB2 UDB”, i = i 1 (a,t), p=p 1 (a,t), t=“DB2 UDB”
• q(a, t) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
• q(a, t) :- s5(a, x, y), i = i 1 (a,x,y), s5(a,t,p), i = i 1 (a,t,p), p=“SAMS”, t=“DB2 UDB”
• 7. Summary of Data Integration
• Local-as-view integration has replaced global-as-view as the standard
• More robust way of defining mediated schemas and sources
• Mediated schema is clearly defined, less likely to change
• Sources can be more accurately described
• Methods exist for query reformulation, including inverse rules
• Integration requires standardization on a single schema
• Can be hard to get consensus
• Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.]
• Some other aspects of integration were addressed in related papers
• Overlap between sources; coverage of data at sources
• Semi-automated creation of mappings and wrappers
• Data integration capabilities in commercial products: Oracle Fusion, IBM’s WebSphere Information Integrator, numerous packages from middleware companies; MS BizTalk Mapper, IBM Rational Data Architect
• 8. Performance: What Governs It?
• Speed of the machine – of course!
• But also many software-controlled factors that we must understand:
• Caching and buffer management
• How the data is stored – physical layout, partitioning
• Auxiliary structures – indices
• Different algorithms for operations – query execution
• Different orderings for execution – query optimization
• Reuse of materialized views, merging of query subexpressions – answering queries using views; multi-query optimization
• 9. Our General Emphasis
• Goal: cover basic principles that are applied throughout database system design
• Use the appropriate strategy in the appropriate place
• Every (reasonable) algorithm is good somewhere
• … And a corollary: database people reinvent a lot of things and add minor tweaks…
• 10. Storing Tuples in Pages
• Tuples
• Many possible layouts
• Dynamic vs. fixed lengths
• Ptrs, lengths vs. slots
• Tuples grow down, directories grow up
• Identity and relocation
• Objects and XML are harder
• Horizontal, path, vertical partitioning
• Generally no algorithmic way of deciding
• Generally want to leave some space for insertions
t1 t2 t3
• 11. Alternatives for Organizing Files
• Many alternatives, each ideal for some situation, and poor for others :
• Heap files: for full file scans or frequent updates
• Data unordered
• Write new data at end
• Sorted Files: if retrieved in sort order or want range
• Need external sort or an index to keep sorted
• Hashed Files: if selection on equality
• Collection of buckets with primary & overflow pages
• Hashing function over search key attributes
• 12. Model for Analyzing Access Costs
• We ignore CPU costs, for simplicity:
• p(T): The number of data pages in table T
• r(T): Number of records in table T
• D: (Average) time to read or write disk page
• Measuring number of page I/O’s ignores gains of pre-fetching blocks of pages; thus, I/O cost is only approximated.
• Average-case analysis; based on several simplistic assumptions.
• Good enough to show the overall trends!
• 13. Approximate Cost of Operations
• Several assumptions underlie these (rough) estimates!
* * No overflow buckets, 80% page occupancy Heap File Sorted File Hashed File Scan all recs p(T) D p(T) D 1.25 p(T) D Equality Search p(T) D / 2 D log 2 p(T) D Range Search p(T) D D log 2 p(T) + (# pages with matches) 1.25 p(T) D Insert 2D Search + p(T) D 2D Delete Search + D Search + p(T) D 2D
• 14. Speeding Operations over Data
• Recall that we’re interested in how to get good performance in answering queries
• The first consideration is how the data is made accessible to the DBMS
• We saw different arrangements of the tables: Heap (unsorted) files, sorted files, and hashed files
• Today we look further at 3 core concepts that are used to efficiently support sort- and hash-based access to data:
• Indexing
• Sorting
• Hashing
• 15. Technique I: Indexing
• An index on a file speeds up selections on the search key attributes for the index (trade space for speed).
• Any subset of the fields of a relation can be the search key for an index on the relation.
• Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).
• An index contains a collection of data entries , and supports efficient retrieval of all data entries k* with a given key value k .
• Generally the entries of an index are some form of node in a tree – but should the index contain the data, or pointers to the data?
• 16. Alternatives for Data Entry k* in Index
• Three alternatives for where to put the data:
• Data record wherever key value k appears
• Clustered  fast lookup
• Index is large; only 1 can exist
• < k , rid of data record with search key value k > , OR
• < k , list of rids of data records with search key k >
• Can have secondary indices
• Smaller index may mean faster lookup
• Often not clustered  more expensive to use
• Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k
rid = row id, conceptually a pointer
• 17. Classes of Indices
• Primary vs. secondary : primary has the primary key
• Most DBMSs automatically generate a primary index when you define a primary key
• Clustered vs. unclustered : order of records and index are approximately the same
• Alternative 1 implies clustered, but not vice-versa
• A file can be clustered on at most one search key
• Dense vs. Sparse : dense has index entry per data value; sparse may “skip” some
• Alternative 1 always leads to dense index [Why?]
• Every sparse index is clustered!
• Sparse indexes are smaller; however, some useful optimizations are based on dense indexes
• 18. Clustered vs. Unclustered Index
• Suppose Index Alternative (2) used, with pointers to records stored in a heap file
• Perhaps initially sort data file, leave some gaps
• Inserts may require overflow pages
• Consider how these strategies affect disk caching and access
Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
• 19. B+ Tree: The DB World’s Favorite Index
• Insert/delete at log F N cost
• (F = fanout, N = # leaf pages)
• Keep tree height-balanced
• Minimum 50% occupancy (except for root).
• Each node contains d <= m <= 2 d entries. d is called the order of the tree.
• Supports equality and range searches efficiently.
Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
• 20. Example B+ Tree
• Search begins at root, and key comparisons direct it to a leaf.
• Search for 5*, 15*, all data entries >= 24* ...
• Based on the search for 15*, we know it is not in the tree!
Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13
• 21. B+ Trees in Practice
• Typical order: 100. Typical fill-factor: 67%.
• average fanout = 133
• Typical capacities:
• Height 4: 1334 = 312,900,700 records
• Height 3: 1333 = 2,352,637 records
• Can often hold top levels of tree in buffer pool:
• Level 1 = 1 page = 8 KB
• Level 2 = 133 pages = 1 MB
• Level 3 = 17,689 pages = 133 MB
• Level 4 = 2,352,637 pages = 18 GB
• “ Nearly O(1)” access time to data – for equality or range queries!
• 22. Inserting Data into a B+ Tree
• Find correct leaf L.
• Put data entry onto L.
• If L has enough space, done!
• Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L.
• This can happen recursively
• To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
• Splits “grow” tree; root split increases height.
• Tree growth: gets wider or one level taller at top .
• 23. Inserting 8* Example: Copy up Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Want to insert here; no room, so split & copy up: 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) 8*
• 24. Inserting 8* Example: Push up Root 17 24 30 2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5* 7* 8* 5 Need to split node & push up 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
• 25. Deleting Data from a B+ Tree
• Start at root, find leaf L where entry belongs.
• Remove the entry.
• If L is at least half-full, done!
• If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
• If re-distribution fails, merge L and sibling.
• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
• Merge could propagate to root, decreasing height.
• 26. B+ Tree Summary
• B+ tree and other indices ideal for range searches, good for equality searches.
• Inserts/deletes leave tree height-balanced; log F N cost.
• High fanout (F) means depth rarely more than 3 or 4.
• Almost always better than maintaining a sorted file.
• Typically, 67% occupancy on average.
• Note: Order (d) concept replaced by physical space criterion in practice (“at least half-full”).
• Records may be variable sized
• Index pages typically hold more entries than leaves