Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Query Tree Question <ul><li>Should we do a  pname, pnumber then  pname = ‘Aquarius’ then  pnumber ? </li></ul><ul><li>No, since the operations are done together </li></ul><ul><ul><li>the processor would read a row of project, see if pname = ‘Aquarius’ then use pnumber to perform the join. </li></ul></ul><ul><li>Our query tree has only 2 groups, not 3 </li></ul>
  2. 2. <ul><li>Select Operation Strategies </li></ul><ul><li>And Indexing </li></ul><ul><li>(Chapter 8) </li></ul><ul><li>*Some info on slides from Dr. S. Son, U. Va </li></ul>
  3. 3. Disk access <ul><li>DBs traditionally stored on disk </li></ul><ul><li>Cheaper to store on disk than in memory </li></ul><ul><li>Costs for: </li></ul><ul><ul><li>Seek time, latency, data transfer time </li></ul></ul><ul><li>  Disk access is page oriented </li></ul><ul><li>2 - 4 KB page size </li></ul>
  4. 4. Access time <ul><li>Time to randomly access a page : </li></ul><ul><ul><li>12-20 ms which is 50-83 I/O's per second </li></ul></ul><ul><li>Large disparity between disk access and memory access (10-200 ns) </li></ul><ul><li>System initially determines if page in memory buffer (page tables, etc.) </li></ul>
  5. 5. Table scan <ul><li>Linear search - all data rows read in </li></ul><ul><ul><li>I/O parallelism can be used </li></ul></ul><ul><ul><ul><li>multiple I/O read requests satisfied at the same time </li></ul></ul></ul><ul><ul><ul><li>stripe the data across different disks        </li></ul></ul></ul><ul><ul><li>Problems with parallelism? </li></ul></ul><ul><ul><ul><li>must balance disk arm load to gain maximum parallelism </li></ul></ul></ul><ul><ul><ul><li>requires the same total number of random I/O's, but using devices for a shorter time </li></ul></ul></ul>
  6. 6. Sequential prefetch I/O <ul><li>Retrieve one disk page after another (on same track) - typically 32 </li></ul><ul><li>Seek time no longer a problem </li></ul><ul><li>Must know in advance to read 32 successive pages </li></ul><ul><li>Speed up of I/O by a factor of ≈ 10 (500 I/O's per second vs. 70) </li></ul>
  7. 7. Access time <ul><li>Seek time – 10-15ms </li></ul><ul><li>Latency time – 2-5 ms </li></ul><ul><li>Data transfer time – 10-200 ns </li></ul>
  8. 8. Access time for fast I/O <ul><li>RIO            Seq. Prefetch .010             .010                    Seek - disk arm to cylinder .002             .002                    Latency - platter to sector . 0015            .048                  Data transfer - Page .0135           .060                   1 page vs. 32 pages </li></ul><ul><li>.43 seconds  .060 seconds for 32 pages for both </li></ul>
  9. 9. Textbook access time <ul><li>RIO            Seq. Prefetch .008             .008                   Seek - disk arm to cylinder .004            .004                  Latency - platter to sector . 0005            .016                Data transfer - Page .0125           .028                   1 page vs. 32 pages </li></ul><ul><li>.40 seconds  .028 seconds for 32 pages for both </li></ul>
  10. 10. Disk allocation <ul><li>Disk Resource Allocation for Databases (DBA has control) </li></ul><ul><li>Goal – contiguous sectors on disk - want data as close together as possible  to minimize seek time </li></ul><ul><li>No standard SQL approach, but general way to deal with allocation </li></ul><ul><li>Some OS allow specification of size of file and disk device </li></ul>
  11. 11. Tablespace <ul><li>Allocation medium for tables and indexes for ORACLE, DB2, etc. </li></ul><ul><li>Usually relations (files) cannot span disk devices </li></ul><ul><li>Can put >1 table in a table space if accessed together </li></ul><ul><li>Tablespace corresponds to 1 or more OS files and can span disk devices </li></ul>
  12. 12. Query Language <ul><li>ORACLE DB's contain several tablespaces, including one called system -      data description +  indexes + user-defined tables </li></ul><ul><li>Create tablespace tspace1 datafile 'fname1', 'fname2'; </li></ul><ul><li>default tablespace given to each user </li></ul><ul><li>if multiple tablespaces - better control over load balancing </li></ul><ul><li>can take some disk space off-line </li></ul>
  13. 13. Extent <ul><li>extent - contiguous storage on disk </li></ul><ul><li>when data segment or index segment first created, given an initial extent from tablespace 10KB (5 pages) </li></ul><ul><li>if need more space given next contiguous extent </li></ul><ul><li>can increase the size by a positive % (cannot decrease)                     initial n - size of initial extent                     next n - size of next                     max extents - maximum number of extents                     min extents - number of extents initially allocated                     pct increase n - % by which next extent </li></ul><ul><ul><ul><ul><ul><li>grows over previous one </li></ul></ul></ul></ul></ul>
  14. 14. Create table <ul><li>Create table statement - can specify tablespace, no. of extents </li></ul><ul><ul><li>When initial extent full, new extent allocated </li></ul></ul><ul><li>pctfree - determine how much space can be used for inserts of new rows </li></ul><ul><ul><li>if pctfree =10%, inserts stop when page is 90% full </li></ul></ul><ul><li>pctused – determines when new inserts start again </li></ul><ul><ul><li>if fall below certain percentage of total, default pctused = 40%                  pctfree + pctused < 100 </li></ul></ul>
  15. 15. Rows <ul><li>Row layout on each disk page (see figure) </li></ul><ul><li>Row directory – row number and page byte offset </li></ul><ul><ul><li>Row number is row number in page – book calls it slot# </li></ul></ul><ul><ul><li>Page byte offset – with varchar, row size not constant </li></ul></ul><ul><li>To identify a particular row use RID (RowID) – </li></ul><ul><li>page #, slot # [file#] </li></ul><ul><li>slot# is number in row directory (logical #) </li></ul>
  16. 16. Differences in DBMSs <ul><li>RID can be retrieved in ORACLE but not DB2 (violates relational model rule) </li></ul><ul><li>ORACLE </li></ul><ul><ul><ul><li>rows can be slit between pages (row record fragmentation) </li></ul></ul></ul><ul><ul><ul><li>Can have rows from multiple tables on same page, more info </li></ul></ul></ul><ul><li>DB2, no splitting, entire row moved to new page, need forwarding pointer </li></ul>
  17. 17. Binary Search <ul><li>`` Find all students with gpa > 3.0 ’’ </li></ul><ul><ul><li>If data is in sorted file, do binary search to find first such student, then scan to find others. </li></ul></ul><ul><ul><li>Cost of binary search can be quite high. </li></ul></ul><ul><li>Simple idea: Create an `index’ file. </li></ul>Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File
  18. 18. Binary Search <ul><li>Binary search on disk </li></ul><ul><ul><li>optimal for comparisons - not optimal for disk-based look-up </li></ul></ul><ul><ul><li>must keep data in order </li></ul></ul><ul><ul><li>may be reading values from same page at different times </li></ul></ul><ul><li> Instead use B+-tree index </li></ul>
  19. 19. Indexing <ul><li>Keyed access retrieval method </li></ul><ul><li>index is a sorted file - sorted by index key </li></ul><ul><li>index entries: </li></ul><ul><ul><ul><li>index key pointer  (RID) </li></ul></ul></ul><ul><ul><ul><li>   </li></ul></ul></ul><ul><li>pointer is RID   </li></ul><ul><li>index resides on disk, partially memory resident when accessed </li></ul>
  20. 20. Indexing <ul><li>As for any index, 3 alternatives for data entries k* : </li></ul><ul><ul><li>Data record with key value k </li></ul></ul><ul><ul><li>< k , rid of data record with search key value k > </li></ul></ul><ul><ul><li>< k , list of rids of data records with search key k > </li></ul></ul><ul><li>Choice is orthogonal to the indexing technique used to locate data entries k* . </li></ul><ul><li>Tree-structured indexing techniques support both range searches and equality searches . </li></ul><ul><li>B+ tree : dynamic, adjusts gracefully under inserts and deletes. </li></ul>
  21. 21. B+-tree <ul><li>Most commonly used index structure type in DBs today </li></ul><ul><li>Based on B-tree </li></ul><ul><li>Used to minimize disk I/O </li></ul><ul><li>available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes) Example </li></ul>
  22. 22. Structure of B+ Trees <ul><li>leaf level pointers to data (RIDs) </li></ul><ul><li>the remaining are directory (index) nodes that point to other index nodes </li></ul>Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
  23. 23. Characteristics of B+ Tree <ul><li>Insert/delete at log F N cost; keep tree height-balanced . (F = fanout, N = # leaf pages) </li></ul><ul><li>Minimum 50% occupancy (except for root). Each node contains d <= m <= 2 d entries. The parameter d is called the order of the tree. </li></ul><ul><li>Supports equality and range-searches efficiently </li></ul>
  24. 24. Cost of I/O for B+-tree <ul><li>Assume number of entries in each index node fits on one page - one node is one page </li></ul><ul><li>If tree with depth of 3, 3 I/Os to get pointer to data B+-tree structured to get most out of every disk page read </li></ul><ul><li>Read in index node, can make multiple probes to same page if remains in memory </li></ul><ul><ul><li>likely since frequent access to upper -level nodes of actively used B+-trees </li></ul></ul>
  25. 25. B+ Trees in Practice <ul><li>Typical order: 100. Typical fill-factor: 67%. </li></ul><ul><ul><li>average fanout = 133 </li></ul></ul><ul><li>Typical capacities: </li></ul><ul><ul><li>Height 4: 133 4 = 312,900,700 records </li></ul></ul><ul><ul><li>Height 3: 133 3 = 2,352,637 records </li></ul></ul><ul><li>Can often hold top levels in buffer pool: </li></ul><ul><ul><li>Level 1 = 1 page = 8 Kbytes </li></ul></ul><ul><ul><li>Level 2 = 133 pages = 1 Mbyte </li></ul></ul><ul><ul><li>Level 3 = 17,689 pages = 133 MBytes </li></ul></ul>
  26. 26. B+-tree <ul><li>Index has a directory structure that allows retrieval of a range of values efficiently </li></ul><ul><ul><li>search for leftmost index entry S i such that </li></ul></ul><ul><li>X <= S i </li></ul><ul><li>Index entries always placed in sequence by value - can use sequential prefetch on index </li></ul><ul><li>Index entries shorter than data rows and require proportionately less I/O </li></ul>
  27. 27. B+-tree <ul><li>Balancing of B+-trees - insert, delete </li></ul><ul><li>nodes usually not full </li></ul><ul><li>utilities to reorganize to lower disk I/O </li></ul><ul><li>most systems allow nodes to become depopulated- no automatic algorithm to balance </li></ul><ul><li>average node below root level 71% full in active growing B+-trees </li></ul>
  28. 28. Inserting into B+ Tree <ul><li>Find correct leaf L. </li></ul><ul><li>Put data entry onto L . </li></ul><ul><ul><li>If L has enough space, done ! </li></ul></ul><ul><ul><li>Else, must split L (into L and a new node L2) </li></ul></ul><ul><ul><ul><li>Redistribute entries evenly, copy up middle key. </li></ul></ul></ul><ul><ul><ul><li>Insert index entry pointing to L2 into parent of L . </li></ul></ul></ul><ul><li>This can happen recursively </li></ul><ul><ul><li>To split index node , redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) </li></ul></ul><ul><li>Splits “grow” tree; root split increases height. </li></ul><ul><ul><li>Tree growth: gets wider or one level taller at top. </li></ul></ul><ul><ul><li>Algorithm from MSU </li></ul></ul>
  29. 29. Deleting from B+ tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. </li></ul><ul><ul><li>If L is at least half-full, done! </li></ul></ul><ul><ul><li>If L has only d-1 entries, </li></ul></ul><ul><ul><ul><li>Try to re-distribute , borrowing from sibling (adjacent node with same parent as L) . </li></ul></ul></ul><ul><ul><ul><li>If re-distribution fails, merge L and sibling. </li></ul></ul></ul><ul><li>If merge occurred, must delete entry (pointing to L or sibling) from parent of L . </li></ul><ul><li>Merge could propagate to root, decreasing height. </li></ul><ul><li>Algorithm from MSU </li></ul>
  30. 30. Duplicate key values <ul><li>Duplicate key values in index </li></ul><ul><li>leaf nodes have sibling pointers </li></ul><ul><li>but a delete of a row that has a heavily duplicated key entails a long search through the leaf-level of the B+-tree </li></ul><ul><li>Index compression - with multiple duplicates </li></ul><ul><li>| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID| </li></ul><ul><li> where PrX is count of RID values </li></ul>
  31. 31. Create Index <ul><li>    Options:         </li></ul><ul><li> multiple columns         tablespace         storage - initial extents, etc.         percent free default = 10 </li></ul><ul><li>% of each page left unfilled </li></ul><ul><li> free page (1 free page for every n index pages) </li></ul><ul><li>     Can control % of B+-tree node pages left unfilled when index created, refers to initial creation </li></ul>
  32. 32. Clustering <ul><li>Placing rows on disk in order by some common index key value        (remember the index itself is always sorted) </li></ul><ul><li>clustered (clustering) index - index with rows in the same order as the key values </li></ul><ul><li>efficiency advantage        read in a page, get all of the rows with </li></ul><ul><li>the same value </li></ul><ul><li>clustering is useful for range queries         e.g.  between keyval1 and keyval2 </li></ul>
  33. 33. Clustering <ul><li>can only cluster table by 1 clustering index at a time </li></ul><ul><li>In DB2 – </li></ul><ul><ul><li>if the table is empty, rows sorted as placed on disk </li></ul></ul><ul><ul><li>subsequent insertions not clustered, must use REORG </li></ul></ul>
  34. 34. Indexes vs. table scan <ul><li>To illustrate the difference between table scan, secondary index (non clustered)     and clustered index </li></ul><ul><li>Assume 10 M customers, 200 cities </li></ul><ul><li>2KB/page, row = 100 bytes, 20 rows/page </li></ul><ul><li>            Select *             From Customers             Where city = Birmingham </li></ul><ul><li>1/200 * 10M if assume selectivity = 1/200 </li></ul><ul><li>50,000 customers in a city </li></ul>
  35. 35. Table Scan <ul><li>Table Scan - read entire table </li></ul><ul><li>10,000,000/20 = 500,000 pages    </li></ul><ul><li>If use prefetch? </li></ul><ul><li>500000/32 * .? = </li></ul>
  36. 36. Clustering Index <ul><li>Clustering Index – </li></ul><ul><li>All entries for B'ham clustered on same pages </li></ul><ul><li>50,000/20 = 2500 pages (with 20 rows per page)   </li></ul><ul><li>(3 + 50 + 2500)*?= </li></ul>
  37. 37. Secondary Index <ul><li>Secondary Index– </li></ul><ul><li>In the worst case 1 entry for B'ham per page </li></ul><ul><li>50,000 pages (10M/200) </li></ul><ul><li>3 upper nodes of the tree   </li></ul><ul><li>Assume 1000 index entries per leaf node, read 50000/1000 index pages </li></ul><ul><li>(3 + 50 + 50,000)*?= </li></ul>
  38. 38. List Prefetch <ul><li>Create list of data pages to access </li></ul><ul><li>system orders pages to minimize disk I/O </li></ul><ul><ul><li>E.g. elevator algorithm for disk request scheduling </li></ul></ul>
  39. 39. % Free <ul><li>Redo the previous calculations assuming relations created with 50% free option specified. </li></ul>
  40. 40. Multiple Indexes <ul><li>More than one index on a relation             </li></ul><ul><ul><li>e.g. class - one index, gender - one index </li></ul></ul>
  41. 41. Composite Index <ul><li>One index based on more than one attribute   Create Index index_name on Table (col1, col2,... coln) </li></ul><ul><li>    Composite index entry - values for each attribute             class, gender             entry in index is:  C1, C2, RID </li></ul><ul><li>What would B+ tree look like? </li></ul>
  42. 42. Creating Indexes <ul><li>When determining what indexes to create consider: workload - mix of queries and frequencies of requests             20% of requests are updates, etc. </li></ul><ul><li>            can create lots of indexes but:                 cost to create                 insertions                 initial load time high if a large table                 index entries can become longer and longer as multiple columns included </li></ul>