Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Introduction to database lecture# 12: This lecture is all about Disk Storage Management and indexing and more....

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. Disk Storage Management1Fall 2001 Database Systems 1Indexing• Indexing is a combination of methods for speeding up theaccess to the data in a database• The speed is determined by two factors– where the data is stored on disk– available access paths• The primary access method to any table is a table scan,reading a table by finding all tuples one by one.• Indexing creates multiple access paths to the data, each ofwhich is called a secondary access method.– indexing speeds up access to a table for a specific set ofattributesFall 2001 Database Systems 2Indexing• Example: items(itemid, description, price, city)– to find all items in “Dallas”, read all tuples from secondarystorage and check if city=“Dallas” for each one (scan)• may read many extra tuples that are not part of the answer– suppose instead we create index itemcity for the city attribute ofthe items relationitemcity, Dallas:{t1,t5,t10}, Boston:{t2,t3,t15}, …– to find all items for “Dallas”, we can now find the index entry for“Dallas” and get the ids of just the tuples we want• many fewer tuples are read from secondary storage
  2. 2. Disk Storage Management2Fall 2001 Database Systems 3Disk terminologyrotationPlatters(2 platters = 4 read/write surfacesRead/write headsone for each surfaceDisk armcontrollerTrackCan read four tracks from foursurfaces at the same time (alltracks at the same radius arecalled a cylinder)Fall 2001 Database Systems 4Disk terminology• Reading data from a disk involves:– seek time -- move heads to the correct track/cylinder– rotational latency -- wait until the required data spins under theread/write heads (average is about half the rotation time)– transfer time -- transfer the required data (read/write data to/froma buffer)• Tracks contain the same amount of information even though theyhave different circumferencestrackBlock/pageA block or a page is thesmallest unit of datatransfer for a disk.Read a block / write ablock
  3. 3. Disk Storage Management3Fall 2001 Database Systems 5Disk Space• Disk is much cheaper than memory (1GB: $15 - $20)• A fast disk today:– 40 - 10 GB per disk– 4.9 ms average read seek time– 2.99 ms average latency– 10,000 rpm– 280 - 452 Mbits/sec transfer rate– 7.04 bits per square inch density– 12 - 3 heads, 6 - 2 disk platters• Disk is non-volatile storage thatsurvives power failuresFall 2001 Database Systems 6Reading from disk• Reading data from disk is extremely slow (compared to reading frommemory)• To read a page from disk– find the page on disk (seek+latency times)– transfer the data to memory/buffer (total # bytes * transfer rate)• Assume the average page size is 4KB. To retrieve a single row/tuple, weneed to load the page that contains it– assume 300 Mbits/sec transfer rate, to read a page:4KB=32764bits=0.03Mbits hence we take 1/10000 of a second– 4.9 ms (seek) + 2.99 ms (latency) + 0.1 ms (transfer time) = 7.99 ms(seek and latency times dominate!)
  4. 4. Disk Storage Management4Fall 2001 Database Systems 7Reading from disk• Assume the database saves a number of memory slots (eachholding exactly one page), which are called buffers• To read / modify / write tuple t– DISK: (read it from disk, write it to buffer)– DB: (read it from buffer, modify)– DISK: (write it to disk, free the buffer space)Buffer slotsThis buffer can hold4 pages at any timeFall 2001 Database Systems 8Tablespaces• Age old wisdom: if you store a set of pages in contiguous pages /blocks on disk, then the transfer time will improve greatly (reduceseek and latency times)• A tablespace is an allocation of space in secondary storage– when creating a tablespace, a DBMS requests contiguousblocks of disk space from the OS– the tablespace appears as a single file to the OS– the management of physical addresses in a tablespace isperformed by the DBMS– a DBMS can have many tablespaces– when a table is created, it is placed in a tablespace, orpartitioned between multiple tablespaces
  5. 5. Disk Storage Management5Fall 2001 Database Systems 9TablespacesCREATE TABLESPACE tspace1DATAFILE ‘diska:file1.dat’ SIZE 20M,DATAFILE ‘diska:file2.dat’ SIZE 40M REUSE;CREATE TABLE temp1 (…TABLESPACE file1STORAGE (initial 6144, next 6144,minextents 1, maxextents 5)) ;CREATE TABLE temp2 (…TABLESPACE file2STORAGE (initial 12144, next 6144,minextents 1, maxextents 5)) ;tspace1file1 file2temp1 temp2Actual dataFall 2001 Database Systems 10Tablespaces• Create table -- assign the tuples in the table to a file in a tablespace– when a table is created, a chunk of space is assigned to thistable, the size of this chunk is given by the “INITIALEXTENT”– when the initial extent becomes full, a new chunk is allocated,the size of all next chunks is given by the “NEXTEXTENT”– can also specify• maxextents, minextents• pctincrease (increase the size of extents at each step)• pctfree (how much of the extent must be left free
  6. 6. Disk Storage Management6Fall 2001 Database Systems 11Data Storage on pages• Layout of a single disk page (assume fixed size rows)• To find a specific row in a page, must know– page number (or block number) BBBBBBBB– offset (slot number of record within the page) SSSS– file name (which datafile/tablespace) FFFF– ROWID is then a unique numberBBBBBBBB.SSSS.FFFF for a row• B,S,F are hexadecimal numbersHeaderinfo row directory1 2 N...Free spaceData rowsRow N Row N-1 Row 1...Fall 2001 Database Systems 12Pseudocolumns• Since each tuple has a unique rowid, we can refer to thetuples with their rowid field• However, rowid maychange if the tuple isstored at a differentlocation (the valueofits primary key is abetter identifier)
  7. 7. Disk Storage Management7Fall 2001 Database Systems 13Indexing Concepts• Indexing speeds up access to data residing on disk– disk access is much slower than main memory access,by orders of magnitude– goal – minimize the number of disk accesses• Primary access methods rely on the physical location ofdata as stored in a relation– finding “all” tuples with value “x” requires reading theentire relation• Secondary access methods use a directory to enabletuples to be found more quickly based on the value of oneor more attributes (keys) in a tupleFall 2001 Database Systems 14Secondary Index• To create a simple index on column A of table T, make alist of pairs of the form(attribute A value, tuple rowid)for each tuple in T– example: secondary index for the SSN attributeSSN ROWIDs (RID)111-11-1111 AAAAqYAABAAAEPvAAH222-22-2222 AAAAqYAABAAAEPvAAD333-33-3333 AAAAqYAABAAAEPvAAG. . . .• This index is large and stored on the disk
  8. 8. Disk Storage Management8Fall 2001 Database Systems 15Secondary Index• Suppose a disk page can contain 200 index entriesfrom a secondary index• To store a secondary index for a relation with 1million tuples assuming no duplicate values requires:1,000,000 / 200 = 5,000 disk pages• To find a particular Person tuple in the SSN indexgiven his or her SSN, you must on average scan halfof the index (5,000 / 2 = 2500 disk accesses)• If 20 tuples of the Person relation fit on a page, thensequential scan of the relation itself needs to read onaverage half the relation (50,000 / 2 = 25,000 diskaccesses)• In this case, the secondary index helps a lotFall 2001 Database Systems 16Efficiency• Need to organize the index information in away that makes it efficient to access andsearch– scanning the index from the beginning isnot good enough• Sorting the secondary index helps, but is notsufficient• Solution 1: build a tree index• Solution 2: hash the index
  9. 9. Disk Storage Management9Fall 2001 Database Systems 17Tree Indices• Want to minimize number of disk accesses.– each tree node requires a disk access– therefore, trees that are broad and shalloware preferred over trees that are narrowand deep• Balanced binary search tree, AVL tree, etc.that are useful in main memory are too narrowand deep for secondary storage.• Need an m-way tree where m is large.– also need a tree that is balancedFall 2001 Database Systems 18B+-Tree• A B+ -Tree of order d is a tree in which:– each node has between d and 2d key values– the keys values within a node are ordered– each key in a node has a pointer immediately beforeand after it• leaf nodes: pointer following a key is pointer torecord with that key• interior nodes: pointers point to other nodes in thetree– the length of the path from root to leaf is the same forevery leaf (balanced)– the root may have fewer keys and pointers
  10. 10. Disk Storage Management10Fall 2001 Database Systems 19Example B+ -Tree66 69 71 762 7 11 15 22 30 41 53 54 63 78 84 935311 30 66 78B+-Tree of order 2each node can hold up to four keysFall 2001 Database Systems 20Searching in B+-TreesSearch(T, K) /* searching for tuple with key value K in tree T */{if T is a non-leaf nodesearch for leftmost key K’ in node T such that K < K’if such a K’ existsptr = pointer in T immediately before K’return the result of Search(ptr, K)if no such K’ existsptr = rightmost pointer in node Treturn the result of Search(ptr, K)else if T is a leaf nodesearch for K in Tif found, return the pointer following Kelse return NULL /* K not in tree T */}
  11. 11. Disk Storage Management11Fall 2001 Database Systems 21Insert Algorithm• To insert a new tuple with key K and address rowid:– use a modified Search algorithm to look for the leafnode into which key K should be inserted– insert key K followed by address rowid into theproper place in this leaf node to maintain order andrebalance the tree if necessary• Rebalancing the tree– if the leaf node has room for K and rowid, then norebalancing is needed– if the leaf node has no room for K and rowid, then itis necessary to create a new node and rebalancethe treeFall 2001 Database Systems 22Rebalancing Algorithm• Assume that K and rowid are to be inserted into leaf node L,but L has no more room.– create a new empty node– put K and rowid in their proper place among the entries inL to maintain the key sequence order -- there are 2d+1keys in this sequence– leave the first d keys with their rowids in node L and movethe final d+1 keys with their rowids to the new node– copy the middle key K’ from the original sequence into theparent node followed by a pointer to the new node• put them immediately after the pointer to node L in theparent node– apply this algorithm recursively up the tree as needed
  12. 12. Disk Storage Management12Fall 2001 Database Systems 23Insert ExampleInsert record with key 57B+–Tree of order 266 69 71 762 7 11 15 22 30 41 53 54 63 78 845111 30 66 785766 69 71 762 7 11 15 22 30 41 53 54 63 78 845111 30 66 7866 69 71 762 7 11 15 22 30 41 53 54 57 63 78 845311 30 66 78Fall 2001 Database Systems 24Another Insert Example66 69 71 762 7 11 15 22 30 41 53 54 57 63 78 845111 30 66 78Insert record with key 65B+–Tree of order 266 69 71 762 7 11 15 22 30 41 53 54 57 63 78 845111 30 66 786566 69 71 762 7 11 15 22 30 41 53 54 57 63 78 845111 30 66 7866 69 71 762 7 11 15 22 30 41 53 54 78 845111 30 66 7857 63 655766 69 71 762 7 11 15 22 30 41 53 54 78 845311 30 57 66 7857 63 65
  13. 13. Disk Storage Management13Fall 2001 Database Systems 25Insertion Algorithm (1)Insert (T, K, rowid, child)/* insert new tuple with key K and address rowid into tree T *//* child is NULL initially */{/* handle an interior node of the B+-Tree */if T is a non-leaf nodefind j such that Kj ≤ K ≤ Kj+1 for keys K1, …, Kn in Tptr = pointer between Kj and Kj+1Insert (ptr, K, rowid, child)if child is NULL then return/* must insert key and child pointer into T */if T has space for another key and pointerput child.key and child.ptr into T at proper placechild = NULLreturnFall 2001 Database Systems 26Insert Algorithm (2)else /* must split node T */construct sequence of keys and pointers from T withchild.key and child.ptr inserted at proper placefirst d keys and d+1 pointers from sequence stay in Tlast d keys and d+1 pointers from sequence move toa new node Nchild.key = middle key from sequencechild.ptr = pointer to Nif T is rootcreate new node containing pointer to T,child.key, and child.ptrmake this node the new root node of the B+-Treereturn/* handle leaf node of the B+-Tree */If T is a leaf nodeif T has space for another key and rowidput K and rowid into T at proper placereturn
  14. 14. Disk Storage Management14Fall 2001 Database Systems 27Insert Algorithm (3)else /* must split leaf node T */construct sequence of keys and pointers from T withK and rowid inserted at proper placefirst d keys and d+1 pointers from sequence stay in Tlast d+1 keys and d+2 pointers from sequence move toa new node Nchild.key = first key in new node Nchild.ptr = pointer to Nif T was rootcreate new node containing pointer to T,child.key, and child.ptrmake this node the new root node of the B+-Treereturn}Fall 2001 Database Systems 28Deletion• Assume that a tuple with key K and address rowid isto be deleted from leaf node L. There is a problem ifafter removing K and rowid from L it has fewer than dkeys remaining. To fix this:– if a neighbor node has at least d+1 keys, thenevenly redistribute the keys and rowids with theneighbor node and adjust the separator key in theparent node– otherwise, combine node L with a neighbor nodeand discard the empty node• the parent node now needs one less key andnode pointer, so recursively apply this algorithmup the tree until all nodes have enough keysand pointers
  15. 15. Disk Storage Management15Fall 2001 Database Systems 29Deletion ExampleRedistribute betweenthe second andthird leaf nodes.66 69 71 762 7 53 54 63 78 84 935311 30 66 78B+-Tree of order 230Delete key 3030 4111 15 22 4122 4111 1522Fall 2001 Database Systems 30Another Deletion ExampleCannot redistribute,so combine the lefttwo leaf nodes66 69 71 762 7 11 15 30 41 53 54 63 78 84 935311 30 66 78B+-Tree of order 2Delete 7 from the B-Tree2 11 15
  16. 16. Disk Storage Management16Fall 2001 Database Systems 31Another Example ContinuedB+-Tree of order 2Delete 7 from the B+-Tree66 69 71 7612 15 30 41 53 54 63 78 84 935130 66 782 11 15Node not valid,too few pointersCannot redistribute,so combine withsibling node66 69 71 7612 15 30 41 53 54 63 78 84 935130 66 782 11 15 66 69 71 7612 15 30 41 53 54 63 78 84 9330 51 66 782 11 15 66 69 71 7612 15 30 41 53 54 63 78 84 9330 53 66 782 11 15Fall 2001 Database Systems 32Deletion Algorithm (1)Delete (Parent, T, K, oldchild)/* delete key K from Tree T *//* Parent is parent node for T, initially NULL *//* oldchild is discarded child node, initially NULL */{/* handle an interior node of the B+-Tree */if T is a non-leaf nodefind j such that Kj ≤ K ≤ Kj+1 for keys K1, …, Kn in Tptr = pointer between Kj and Kj+1Delete (T, ptr, K, oldchild)if oldchild is NULL then return/* must handle discarded child node of T */remove oldchild and adjacent key from Tif T still has enough keys and pointersoldchild = NULLreturn
  17. 17. Disk Storage Management17Fall 2001 Database Systems 33Deletion Algorithm (2)/* must fix node T */get a sibling node S of T using Parentif S has entry keys /* redistribute S and T */redistribute keys and adjacent pointers evenlybetween S and TK’ = middle unused key from the redistributionreplace the key in Parent between the pointersto S and T with K’oldchild = NULLreturnelse /* merge S and T */R = S or T, whichever is to the right of the otheroldchild = Rcopy key from Parent node that is immediatelybefore R to the end of the node on the leftmove all keys and adjacent pointers from Rto the node on the leftdiscard node RreturnFall 2001 Database Systems 34Deletion Algorithm (3)/* handle leaf node of the B+-Tree */if T is a leaf nodeif T has extra keysremove key K from Toldchild = NULLreturn/* must fix node T */get a sibling node S of T using Parentif S has entry keys /* redistribute S and T */redistribute keys and adjacent pointers evenlybetween S and TK’ = first key from node S or T, whichever is tothe right of the otherreplace the key in Parent between the pointersto S and T with K’oldchild = NULLreturn
  18. 18. Disk Storage Management18Fall 2001 Database Systems 35Deletion Algorithm (4)else /* merge S and T */R = S or T, whichever is to the right of the otheroldchild = Rmove all keys and adjacent rowids from Rto the node on the leftdiscard node Rreturn}Fall 2001 Database Systems 36Analysis of B+-Trees• Every access to a node is an access to disk and hence isexpensive.• Analysis of Find:– if there are n tuples in the tree, the height of the tree, h,is bounded by h ceil(logd(n))– example: d = 50, tree contains 1 million records, then h 4• Analysis of Insert and Delete– finding the relevant node required h accesses– rebalancing required O(h) accesses– therefore, the total is O(logdn) accesses
  19. 19. Disk Storage Management19Fall 2001 Database Systems 37B+-tree• The create index command creates a B+-tree indexCREATE INDEX age_idx ON people(age)TABLESPACE file1PCTFREE 70• PCTFREE defines how full each node should be• Optimal operation is usually with nodes about 70% full• To reduce disk accesses for sequential processing,pointers are added to the leaf nodes that point to theprevious and next leaf nodesFall 2001 Database Systems 38A B+-Tree Example• Givens:– disk page has capacity of 4K bytes– each rowid takes 6 bytes and each key value takes 2bytes– each node is 70% full– need to store 1 million tuples• Leaf node capacity– each (key value, rowid) pair takes 8 bytes– disk page capacity is 4K, so (4*1024)/8 = 512 (key value,rowid) pairs per leaf page• in reality there are extra headers and pointers that wewill ignore• Hence, the degree for the tree is about 256
  20. 20. Disk Storage Management20Fall 2001 Database Systems 39Example Continued• If all pages are 70% full, each page has about512*0.7 = 359 entries• To store 1 million tuples, requires1,000,000 / 359 = 2786 pages at the leaflevel2789 / 359 = 8 pages at next level up1 root page pointing to those 8 pages• Hence, we have a B+-tree with 3 levels, and atotal of 2786+8+1 = 2795 disk pagesFall 2001 Database Systems 40Duplicate Key Values• Duplicate key values in a B+-tree can be handled.– (key, rowid) pairs for same key value can spanmultiple index nodes• Search algorithm needs to be changed– find leftmost entry at the leaf level for the searcheditem, then scan the index from left to right followingleaf level pointers• The insertion and deletion algorithms also require smallchanges– they are more costly and hence not alwaysimplemented in practice
  21. 21. Disk Storage Management21Fall 2001 Database Systems 41Bitmap Index• For some attribute x with possible values A,B and C:– create a list of all tuples in the relation and store theirrowids at some known location– build an index for each value, for example for value A• the bitmap contains a “1” at location k if tuple k hasvalue “A” for this attribute• otherwise it contains a “0”– indices with a lot of “0”s are called sparse and can becompressedFall 2001 Database Systems 42s2 15 3 s9 . . .As5 15 2 s8 . . .ABitmap Examples1 10 6 s7 . . .As4 10 3 s6 . . .As3 15 4 s10 . . .ATuples . . .. . .TupleList1 0 0 1 0 . . .0 1 1 0 1 . . .Bitmap for A=10Bitmap for A=15
  22. 22. Disk Storage Management22Fall 2001 Database Systems 43Querying with Bitmap Index• Suppose have bitmap indices on attributes x and y– Find if x=“A” or x=“B”, take the bitmaps for bothvalues and do a logical or– Find if x=“A” and y<>“B”, compute the logical inverseof bitmap for y=“B” and then do a logical and withbitmap for x=“A”• Bitmaps depend on the actual row ids of tuples• If a tuple is deleted, its location can be removed orswapped by another tuple (costly if the index iscompressed)• Too many updates or attributes with too many valueslead to bitmaps that are not cost effectiveFall 2001 Database Systems 44Row directory Tuple 1, Tuple 2, … , Tuple 10B+-tree indexon attributes A1,…,AkPrimary access methodsHeap: tuples are placed in the orderthey are insertedCluster: tuples with the same valuesfor attributes A1,…,Ak are placedclose to each other on diskHash: tuples with the same hash valueare placed close to each other on diskSecondary accessmethodsThe primary accessmethod can be anything.Additional indexes arecreated with entries thatpoint to actual tuplesRow directory Tuple 11, Tuple 12, … , Tuple 20
  23. 23. Disk Storage Management23Fall 2001 Database Systems 45Clusters• A cluster is a primary access method, it changesthe placement of tuples on diskCREATE CLUSTER personnel(department_number integer)SIZE 512STORAGE (INITIAL 100K NEXT 50K)• In ORACLE, a cluster can be generated for manytables containing the same set of attributes• All tuples in different tables from the same clusterwill be placed closed to each other on disk (i.e. onthe same page and on consecutive pages)Fall 2001 Database Systems 46Adding tables to a cluster
  24. 24. Disk Storage Management24Fall 2001 Database Systems 47Clusters• Each table may belong to at most one cluster.• Suppose we retrieve an employee tuple with deptno=10.We find a page with this employee and read it into memory.• If there are 20 employees in the department 10, thenchances are that all these employees are on the samepage.• To find all employees in department 10 through 20, we cansimply read the necessary pages.• A cluster is not an index, but we can also create a B+-treeindex on a cluster:CREATE INDEX idx_personnel ON CLUSTER personnel;Fall 2001 Database Systems 48Hashing• Hashing is another index method that changes the way tuples areplaced on disk• A hash index on attribute A allocates an initial set of pages to storethe relation:123n..New tuple Twith key AHash function hranges between1 and nh(T. A)If multiple tuples map to the samelocation/page, this is called acollision. These tuples are placedin an overflow page.
  25. 25. Disk Storage Management25Fall 2001 Database Systems 49Hashing• The number of key values is given by HASHKEYS• Hashing is useful for finding a tuple with a given key value• Hashing is not as useful for ranges or key values or forsequential processing of tuples in key order• In the best case, a tuple is found with one disk access• In the average case, expect 1.2 disk accesses or more(because of overflow pages)Fall 2001 Database Systems 50Extensible Hashing• Assume that we originally allocate 2n pages for the hash• Distribute tuples according to hash function mod 2– hash the key to produce a bit string and then use the leastsignificant bit• If a disk page becomes full, double the directory size instead ofcreating overflow bucketsPage 0Page 1Hash directorytuples01
  26. 26. Disk Storage Management26Fall 2001 Database Systems 51Extensible Hashing• Insert into a full Page 1 – double the directory sizeThe full page is split into two. Itscontents are rehashed betweenthe original page and the new page,using one additional bit from the stringproduced by the hash functionJust created a new directory entry forPage 3. Since Page 1 is not full, thisdirectory entry points to Page 1. Thecontents of Page 1 will be rehashedwith Page 3 when Page 1 becomes fullrather than creating a new page.Page 0Page 1Page 2Page 3Hash directory00011011New PageFall 2001 Database Systems 52Extensible Hashing• As the hash directory size grows it must be stored onthe disk• At most two disk accesses are needed to retrieve anytuple– this is a better upper bound than for B+-Trees• However, extensible hashing is not as good as B+-Trees for range queries and sequential processingwhere you want to process all the tuples of a relation• Consequently, B+-Trees are used more frequentlythan Extensible Hashing