Select Operation- disk access and Indexing


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 3 7 1 . 2 .3 4.5.7 9.10 1F.1M2F2M3F3M ….
  • Select Operation- disk access and Indexing

    1. 1. <ul><li>Select Operation- disk access </li></ul><ul><li>and Indexing </li></ul><ul><li>*Some info on slides from Dr. S. Son, U. Va </li></ul>
    2. 2. Disk access <ul><li>DBs traditionally stored on disk </li></ul><ul><li>Cheaper to store on disk than in memory </li></ul><ul><li>Costs for: </li></ul><ul><ul><li>Seek time, latency, data transfer time </li></ul></ul><ul><li>  Disk access is page (block) oriented </li></ul><ul><li>2 - 4 KB page size </li></ul>
    3. 3. Access time <ul><li>Access time is the time to randomly access a page </li></ul><ul><li>System initially determines if page in memory buffer (page tables, etc.) </li></ul><ul><li>Large disparity between disk access and memory access </li></ul>
    4. 4. Select operation using table scan <ul><li>If read the entire table for a select – table scan </li></ul><ul><li>Improvements to table scan of disk: </li></ul><ul><ul><li>Parallel access </li></ul></ul><ul><ul><li>Sequential prefetch </li></ul></ul>
    5. 5. Parallel access <ul><li>Linear search - all data rows read in from disk </li></ul><ul><ul><li>I/O parallelism can be used (Raid) </li></ul></ul><ul><ul><ul><li>multiple I/O read requests satisfied at the same time </li></ul></ul></ul><ul><ul><ul><li>stripe the data across different disks        </li></ul></ul></ul><ul><ul><li>Problems with parallelism? </li></ul></ul><ul><ul><ul><li>must balance disk arm load to gain maximum parallelism </li></ul></ul></ul><ul><ul><ul><li>requires the same total number of random I/O's, but using devices for a shorter time </li></ul></ul></ul>
    6. 6. Sequential prefetch I/O <ul><li>Retrieve one disk page after another (on same track) - typically 32 </li></ul><ul><li>Seek time no longer a problem </li></ul><ul><li>Must know in advance to read 32 successive pages </li></ul><ul><li>Speed up of I/O by a factor of ≈ 10 (500 I/O's per second vs. 70) </li></ul>
    7. 7. Access time <ul><li>Seek time – average 8-10 ms, as low as 4 ms server </li></ul><ul><li>Latency time – 2-4 ms, as low as 1 or less </li></ul><ul><li>Data transfer time – .4-2 ms </li></ul>
    8. 8. Access time <ul><li>RIO            Seq. Prefetch .010             .010                    Seek - disk arm to cylinder .002             .002                    Latency - platter to sector . 0015            .048                  Data transfer - Page .0135           .060                   1 page vs. 32 pages </li></ul><ul><li>.43* seconds vs  .060 seconds 32 pages for both </li></ul><ul><li>* .00135X32=.43 </li></ul>
    9. 9. Access time for fast I/O <ul><li>RIO            Seq. Prefetch .004             .004                    Seek - disk arm to cylinder .001             .001                    Latency - platter to sector . 0005            .016                 Data transfer - Page .0055           .021                   1 page vs. 32 pages </li></ul><ul><li>.176* seconds  .021 seconds 32 pages for both </li></ul><ul><li>* .0055X32=.176 </li></ul>
    10. 10. Organizing disk space <ul><li>How to store data so minimize access time if read the entire table? </li></ul>
    11. 11. Disk allocation <ul><li>Disk Resource Allocation for Databases (DBA has control) </li></ul><ul><li>Goal – contiguous sectors on disk - want data as close together as possible  to minimize seek time </li></ul><ul><li>No standard SQL approach, but general way to deal with allocation </li></ul><ul><li>Some OS allow specification of size of file and disk device </li></ul>
    12. 12. Types of Files <ul><li>Heap files (unordered – sequential) </li></ul><ul><li>Sorted files (ordered – sort key) </li></ul><ul><li>Hash files (hash key, hash function) </li></ul><ul><ul><li>Internal, external, file expansion </li></ul></ul><ul><ul><li>B+-trees </li></ul></ul><ul><li>Raid technology (parallelizing) </li></ul><ul><li>Storage area networks – ERP (enterprise resource planning) and DW (data warehouses) </li></ul><ul><ul><li>Storage devices configured as nodes in network – can attach/detach </li></ul></ul>
    13. 13. Tablespace <ul><li>Tablespace is: </li></ul><ul><li>Allocation medium for tables and indexes for ORACLE, DB2, etc. </li></ul><ul><li>Can put >1 table in a table space if accessed together </li></ul><ul><li>Tablespace corresponds to 1 or more OS files and can span disk devices </li></ul><ul><li>Usually relations cannot span disk devices </li></ul>
    14. 14. DB storage structures <ul><li>DB CAP Database </li></ul><ul><li>Table- tspace 1 system </li></ul><ul><li>space </li></ul><ul><li>OS files fname1 fname2 fname3 </li></ul><ul><li>Tables Cust agents prods orders orindx </li></ul><ul><li>Segments data data data data index </li></ul><ul><li>Extents </li></ul>
    15. 15. Tablespace <ul><li>ORACLE DB's contain several tablespaces, including one called system -      data description +  indexes + user-defined tables </li></ul><ul><li>default tablespace given to each user </li></ul><ul><li>if multiple tablespaces - better control over load balancing </li></ul><ul><li>can take some disk space off-line </li></ul>
    16. 16. Extent <ul><li>Relation composed of 1 or more extents </li></ul><ul><li>Extent - contiguous storage on disk </li></ul><ul><li>when data segment or index segment first created, given an initial extent from tablespace 10KB (5 pages) </li></ul><ul><li>if need more space given next contiguous extent </li></ul>
    17. 17. Extent <ul><li>Can increase the size by a positive % (cannot decrease) </li></ul><ul><ul><li>initial n - size of initial extent </li></ul></ul><ul><ul><li>next n - size of next </li></ul></ul><ul><ul><li>max extents - maximum number of extents </li></ul></ul><ul><ul><li>min extents - number of extents initially allocated </li></ul></ul><ul><ul><li>pct increase n - % by which next extent grows over previous one </li></ul></ul>
    18. 18. Oracle create tablespace <ul><li> </li></ul>
    19. 19. Create table <ul><li>Create table statement - can specify tablespace, no. of extents </li></ul><ul><ul><li>When initial extent full, new extent allocated </li></ul></ul><ul><ul><li>pctfree - determine how much space in a page can be used for inserts of new rows </li></ul></ul><ul><ul><ul><li>if pctfree =10%, inserts stop when page is 90% full </li></ul></ul></ul><ul><ul><ul><ul><ul><li>Uses another page </li></ul></ul></ul></ul></ul><ul><ul><li>pctused – determines when new inserts start again </li></ul></ul><ul><ul><ul><li>if fall below certain percentage of total, default pctused = 40%                  pctfree + pctused < 100 </li></ul></ul></ul><ul><ul><ul><li>For more info: </li></ul></ul></ul>
    20. 20. Rows <ul><li>Row layout on each disk page </li></ul><ul><li>1 2 3… N Row N Row N-1 … Row 1 </li></ul><ul><li>Header info Row directory free space data rows </li></ul><ul><li>Header - </li></ul><ul><li>Row directory – row number and page byte offset </li></ul><ul><ul><li>Row number is row number in page – book calls it slot# </li></ul></ul><ul><ul><ul><li>Page byte offset – with varchar, row size not constant </li></ul></ul></ul><ul><li>To identify a particular row use RID (RowID) – </li></ul><ul><li>page #, slot # [file#] </li></ul><ul><li>slot# is number in row directory (logical #) </li></ul>
    21. 21. Differences in DBMSs re: rows <ul><li>ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule) </li></ul><ul><li>ORACLE </li></ul><ul><ul><ul><li>rows can be split between pages (row record fragmentation) </li></ul></ul></ul><ul><ul><ul><li>Can have rows from multiple tables on same page, more info </li></ul></ul></ul><ul><li>DB2, no splitting, entire row moved to new page, need forwarding pointer </li></ul>
    22. 22. Select operation using Indexes <ul><li>Alternative to table scan </li></ul>
    23. 23. Binary Search <ul><li>“ Find all students with gpa > 3.0 ” </li></ul><ul><ul><li>If data is in sorted file, do binary search to find first such student, then scan to find others. </li></ul></ul><ul><ul><li>Cost of binary search can be quite high. </li></ul></ul>
    24. 24. Binary Search <ul><li>Binary search on disk </li></ul><ul><ul><li>optimal for comparisons - not optimal for disk-based look-up </li></ul></ul><ul><ul><li>must keep data in order </li></ul></ul><ul><ul><li>may be reading values from same page at different times </li></ul></ul>
    25. 25. Indexing <ul><li>Instead: Create an `index’ file </li></ul><ul><li>Keyed access retrieval method </li></ul><ul><li>index is a sorted file - sorted by index key </li></ul><ul><li>index entries: </li></ul><ul><ul><ul><li>index key pointer  (RID) </li></ul></ul></ul><ul><ul><ul><li>   </li></ul></ul></ul><ul><li>pointer is RID   </li></ul><ul><li>index resides on disk, partially memory resident when accessed </li></ul>
    26. 26. Index File Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File
    27. 27. Tree-based index <ul><li>B-tree – balanced tree </li></ul><ul><li>Nodes point to data (RIDs) and also point to other nodes in tree </li></ul>
    28. 28. B+-tree <ul><li>Most commonly used index structure type in DBs today </li></ul><ul><li>Based on B-tree </li></ul><ul><li>Good for equality and range searches </li></ul><ul><li>B+ tree : dynamic, adjusts gracefully under inserts and deletes. </li></ul><ul><li>Used to minimize disk I/O </li></ul><ul><li>available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes) </li></ul>
    29. 29. Structure of B+ Trees <ul><li>leaf level pointers to data (RIDs) </li></ul><ul><li>the remaining are directory (index) nodes that point to other index nodes Fig. </li></ul>Index Entries Data Entries (&quot;Sequence set&quot;) (Direct search)
    30. 30. Characteristics of B+ Tree <ul><li>Insert/delete at log F N cost; keep tree height-balanced . (F = fanout, N = # leaf pages) </li></ul><ul><li>Minimum 50% occupancy (except for root). Each node contains d <= m <= 2 d entries. The parameter d is called the order of the tree. </li></ul><ul><li>Supports equality and range-searches efficiently </li></ul>
    31. 31. Cost of I/O for B+-tree <ul><li>Assume number of entries in each index node fits on one page - one node is one page </li></ul><ul><li>If tree with depth of 3, 3 I/Os to get pointer to data B+-tree structured to get most out of every disk page read </li></ul><ul><li>Read in index node, can make multiple probes to same page if remains in memory </li></ul><ul><ul><li>likely since frequent access to upper -level nodes of actively used B+-trees </li></ul></ul>
    32. 32. B+ Trees in Practice <ul><li>Typical order: 100. </li></ul><ul><li>Typical fill-factor: 2/3 full (66.6%) </li></ul><ul><ul><li>average fanout = 133 </li></ul></ul><ul><li>Typical capacities: </li></ul><ul><ul><li>Height 4: 133 4 = 312,900,700 records </li></ul></ul><ul><ul><li>Height 3: 133 3 = 2,352,637 records </li></ul></ul><ul><li>Can often hold top levels in buffer pool: </li></ul><ul><ul><li>Level 1 = 1 page = 8 Kbytes </li></ul></ul><ul><ul><li>Level 2 = 133 pages = 1 Mbyte </li></ul></ul><ul><ul><li>Level 3 = 17,689 pages = 133 MBytes </li></ul></ul>
    33. 33. B+-tree <ul><li>B+ tree has a directory structure that allows retrieval of a range of values efficiently </li></ul><ul><ul><li>search for leftmost index entry S i such that </li></ul></ul><ul><li>X <= S i </li></ul><ul><li>Index entries always placed in sequence by value - can use sequential prefetch on index </li></ul><ul><li>Index entries shorter than data rows and require proportionately less I/O </li></ul>
    34. 34. B+-tree <ul><li>Balancing of B+-trees - insert, delete </li></ul><ul><li>Nodes usually not full </li></ul><ul><li>utilities to reorganize to lower disk I/O </li></ul><ul><li>Most systems allow nodes to become depopulated- no automatic algorithm to balance </li></ul><ul><li>Average node below root level 71% full in active growing B+-trees </li></ul><ul><li>Insert/delete </li></ul>
    35. 35. Inserting into B+ Tree <ul><li>Find correct leaf L. </li></ul><ul><li>Put data entry onto L . </li></ul><ul><ul><li>If L has enough space, done ! </li></ul></ul><ul><ul><li>Else, must split L (into L and a new node L2) </li></ul></ul><ul><ul><ul><li>Redistribute entries evenly, copy up middle key. </li></ul></ul></ul><ul><ul><ul><li>Insert index entry pointing to L2 into parent of L . </li></ul></ul></ul><ul><li>This can happen recursively </li></ul><ul><ul><li>To split index node , redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) </li></ul></ul><ul><li>Splits “grow” tree; root split increases height. </li></ul><ul><ul><li>Tree growth: gets wider or one level taller at top. </li></ul></ul><ul><ul><li>Algorithm </li></ul></ul>
    36. 36. Deleting from B+ tree <ul><li>Start at root, find leaf L where entry belongs. </li></ul><ul><li>Remove the entry. </li></ul><ul><ul><li>If L is at least half-full, done! </li></ul></ul><ul><ul><li>If L has only d-1 entries, </li></ul></ul><ul><ul><ul><li>Try to re-distribute , borrowing from sibling (adjacent node with same parent as L) . </li></ul></ul></ul><ul><ul><ul><li>If re-distribution fails, merge L and sibling. </li></ul></ul></ul><ul><li>If merge occurred, must delete entry (pointing to L or sibling) from parent of L . </li></ul><ul><li>Merge could propagate to root, decreasing height. </li></ul><ul><li>Algorithm </li></ul>
    37. 37. Duplicate key values <ul><li>Duplicate key values in index </li></ul><ul><li>leaf nodes have sibling pointers </li></ul><ul><li>but a delete of a row that has a heavily duplicated key entails a long search through the leaf-level of the B+-tree </li></ul><ul><li>Index compression - with multiple duplicates </li></ul><ul><li>| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID| </li></ul><ul><li> where PrX is count of RID values </li></ul>
    38. 38. Create Index <ul><li>    Options:         </li></ul><ul><li> multiple columns         tablespace         storage - initial extents, etc.         percent free default = 10 </li></ul><ul><li>% of each page left unfilled </li></ul><ul><li> free page (1 free page for every n index pages) </li></ul><ul><li>     Can control % of B+-tree node pages left unfilled when index created, refers to initial creation </li></ul>
    39. 39.     Why use an index?  <ul><li>If use a select (or join) on the same attribute frequently </li></ul><ul><li>want a way to improve performance - use indexes </li></ul><ul><ul><li>For example: </li></ul></ul><ul><ul><ul><ul><li>Select from Employee </li></ul></ul></ul></ul><ul><ul><ul><ul><li>where ssn = 333445555 </li></ul></ul></ul></ul><ul><li>Instead of reading the entire file until ssn is found, it would be nice if we had a pointer to that employee </li></ul>
    40. 40. Types of indexes (textbook) <ul><li>Primary index - key field is a candidate key (must be unique) – data file ordered by key field </li></ul><ul><li>Clustering index - key field is not unique, data file is ordered – all records with same values on same pages </li></ul><ul><li>Secondary index - non-clustering index – data file not ordered </li></ul><ul><ul><li>First record in the data page (or block) is called the anchor record </li></ul></ul><ul><ul><ul><li>Non-dense index - pointer in index entry points to anchor </li></ul></ul></ul><ul><ul><ul><li>Dense index - pointer to every record in the file </li></ul></ul></ul>
    41. 41. Non-clustered indexes <ul><li>Non-clustered index (secondary index) </li></ul><ul><ul><li>key field is a non ordering field - it is not used to physically order the data file </li></ul></ul><ul><ul><li>the index itself is still ordered </li></ul></ul><ul><ul><li>How many non-clustering indexes can a table have? </li></ul></ul>
    42. 42. Clustered Indexes <ul><li>Placing rows on disk in order by some common index key value        (remember the index itself is always sorted) </li></ul><ul><ul><li>Clustered index - (primary and clustering) </li></ul></ul><ul><ul><li>key field is an ordering field - all the data with the same values for the key field physically placed on the same pages on the disk. </li></ul></ul><ul><ul><li>If primary key, data ordered on a page by key field </li></ul></ul><ul><ul><li>Usually assume disk pages themselves also clustered on the disk </li></ul></ul><ul><ul><li>How many clustering indexes can a table have? </li></ul></ul>
    43. 43. Clustering <ul><li>Efficiency advantage        read in a page, get all of the rows with </li></ul><ul><li>the same value </li></ul><ul><li>clustering is useful for range queries         e.g.  between keyval1 and keyval2 </li></ul>
    44. 44. Example <ul><li> </li></ul>
    45. 45. Clustering <ul><li>Can only cluster table by 1 clustering index at a time </li></ul><ul><li>In SQL server </li></ul><ul><ul><li>creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified </li></ul></ul><ul><li>In DB2 – </li></ul><ul><ul><li>if the table is empty, rows sorted as placed on disk </li></ul></ul><ul><ul><li>subsequent insertions not clustered, must use REORG </li></ul></ul><ul><li>In Oracle- </li></ul><ul><ul><li>Cluster index – now available for PK in 10g </li></ul></ul><ul><ul><li>Define a cluster to create cluster index for 2 tables </li></ul></ul>
    46. 46. Indexes vs. table scan <ul><li>To illustrate the difference between table scan, secondary index (non clustered) and clustered index </li></ul><ul><ul><li>Assume 10 M customers, 200 cities </li></ul></ul><ul><ul><li>2KB/page, row = 100 bytes, 20 rows/page </li></ul></ul><ul><ul><li>            Select *             From Customers             Where city = Birmingham </li></ul></ul><ul><ul><li>1/200 * 10M if assume selectivity = 1/200 </li></ul></ul><ul><ul><li>50,000 customers in a city </li></ul></ul>
    47. 47. Rules of Thumb for I/O <ul><li>Random I/O – 160 pages/second, .00625 </li></ul><ul><li>Sequential prefetch I/O – 1600 pages/second, .000625 </li></ul><ul><li>Will discuss later: </li></ul><ul><li>List prefetch I/O – 400 pages/second, .0025 </li></ul>
    48. 48. Table Scan <ul><li>Table Scan - read entire table </li></ul><ul><li>If used an random I/O </li></ul><ul><li>10,000,000/20 = 500,000 pages  </li></ul><ul><li>500,000*R = 3125 </li></ul><ul><li>Instead, it makes more sense to use: </li></ul><ul><li>sequential prefetch read 32 pages at a time </li></ul><ul><li>500,000*S = 312 </li></ul>
    49. 49. Clustering Index <ul><li>Clustering Index – </li></ul><ul><li>All entries for B'ham clustered on same pages </li></ul><ul><li>50,000/20 = 2500 pages (with 20 rows per page) </li></ul><ul><li>Assume: 3 upper nodes of the tree   </li></ul><ul><li>Assume: 1000 index entries per leaf node, read 50000/1000 index pages </li></ul><ul><li>  </li></ul><ul><li>(3 + 50 + 2500) * ? = 2,553 * ? </li></ul><ul><li>If assume ?=R, then 2,553*R=16 </li></ul><ul><li>Makes more sense to assume (3+50+2500) * S = 1.6 </li></ul>
    50. 50. Secondary Index <ul><li>Secondary Index– </li></ul><ul><li>In the worst case 1 entry for B'ham per page </li></ul><ul><li>50,000 pages (10M/200) </li></ul><ul><li>3 upper nodes of the tree   </li></ul><ul><li>Assume 1000 index entries per leaf node, read 50000/1000 index pages </li></ul><ul><li>(3 + 50 + 50,000)*? = 50, 053 * ? </li></ul><ul><li>If assume ?=R then 50,053*R=312.8 </li></ul><ul><li>Better to assume (3+50)*S + 50,000*R=312.53 </li></ul>
    51. 51. List Prefetch <ul><li>Create list of data pages to access </li></ul><ul><li>Pages not necessarily in contiguous sequential order </li></ul><ul><li>system orders pages to minimize disk I/O </li></ul><ul><ul><li>E.g. elevator algorithm for disk request scheduling </li></ul></ul><ul><li>50, 053 * L = 125.1 </li></ul><ul><li>Best to assume (3+50)*S+50,000*L=125.03 </li></ul>
    52. 52. % Free <ul><li>Redo the previous calculations assuming relations created with 50% free option specified. </li></ul>
    53. 53. Creating Indexes <ul><li>When determining what indexes to create consider: workload - mix of queries and frequencies of requests             20% of requests are updates, etc. </li></ul><ul><li>            can create lots of indexes but:                 cost to create                 insertions                 initial load time high if a large table                 index entries can become longer and longer as multiple columns included </li></ul>
    54. 54. Multiple Indexes <ul><li>More than one index on a relation             </li></ul><ul><ul><li>e.g. class - one index, gender - one index </li></ul></ul>
    55. 55. Composite Index <ul><li>One index based on more than one attribute   Create Index index_name on Table (col1, col2,... coln) </li></ul><ul><li>    Composite index entry - values for each attribute             class, gender             entry in index is:  C1, C2, RID </li></ul><ul><li>What would B+ tree look like? </li></ul>
    56. 56. Threads <ul><li>Thread results from a fork of a computer program, usually contained inside a process </li></ul><ul><ul><li>Multiple threads inside same process, share resources, address space and memory </li></ul></ul><ul><ul><li>Processes do not share these resources </li></ul></ul><ul><ul><li>Thread have their own stack, copy of registers, PC and local thread storage </li></ul></ul><ul><li>Some languages support multiple threads, but do not execute at the same time </li></ul><ul><ul><li>Kernel threads can run concurrently </li></ul></ul>
    57. 57. Parallel computing <ul><li>Form of computation in which many calculations carried out simultaneously </li></ul><ul><ul><li>Divide large problem into smaller ones </li></ul></ul><ul><ul><li>data, instruction level and task parallelism </li></ul></ul><ul><ul><li>SISD, SIMD, MISD, MIMD </li></ul></ul><ul><li>Dominant paradigm in the form of multicore processors </li></ul><ul><li>Parallel computer – shared or distributed memory </li></ul><ul><li>Parallel program difficult to write due to </li></ul><ul><ul><li>Software bugs, race conditions </li></ul></ul><ul><ul><li>Communication and synchronization </li></ul></ul><ul><li>Multiple processing elements working concurrently </li></ul><ul><ul><li>Single computer with multiple processors, networked computers, special hardware, etc. </li></ul></ul>
    58. 58. <ul><li>Multithreading </li></ul><ul><ul><li>Model to allow multiple threads within single process </li></ul></ul><ul><ul><li>Can execute in parallel on multiprocessor system </li></ul></ul><ul><li>Process, kernel thread, user thread, fiber (cooperatively scheduled, can run in any thread in the same process) </li></ul><ul><li>Subtasks in a parallel program are called threads </li></ul><ul><ul><li>Lightweight version of threads – fibers </li></ul></ul><ul><ul><li>Bigger versions – processes </li></ul></ul>
    59. 59. <ul><li>Parallel computing – model of computation </li></ul><ul><ul><li>Can utilize processes, multithreading to implement </li></ul></ul>