Stage2Raj.ppt

300 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
300
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Stage2Raj.ppt

  1. 1. An open source DBMS for handheld devices Stage 2 by Rajkumar Sen IIT Bombay Under the guidance of Prof. Krithi Ramamritham
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Storage Management </li></ul><ul><li>Query Processing </li></ul><ul><li>Future Work </li></ul>
  3. 3. Introduction <ul><li>Stage 1 </li></ul><ul><li>Survey of </li></ul><ul><ul><li>Storage Models: Flat Storage, Domain Storage, </li></ul></ul><ul><ul><li>and Ring Storage </li></ul></ul><ul><ul><li>Query Processing issues </li></ul></ul><ul><ul><li>Data Synchronization </li></ul></ul><ul><ul><li>Concurrency Control and Recovery </li></ul></ul><ul><li>Goals for stage 2 </li></ul><ul><ul><li>New storage models to further reduce storage cost </li></ul></ul><ul><ul><li>Memory cognizant query processing </li></ul></ul><ul><ul><li>Data Synchronization issues </li></ul></ul><ul><ul><li>System Implementation issues </li></ul></ul>
  4. 4. Storage Management <ul><li>Aim at compactness in representation of data </li></ul><ul><li>Existing storage models </li></ul><ul><ul><li>Flat Storage </li></ul></ul><ul><ul><li>Pointer-based Domain Storage </li></ul></ul><ul><li>In Domain Storage, pointer of size p (typically 4 bytes) to </li></ul><ul><li>point to the domain value. </li></ul><ul><ul><li>Can we further reduce the storage cost? </li></ul></ul>
  5. 5. Storage Management <ul><li>ID Storage : </li></ul><ul><ul><li>An identifier for each of the domain values </li></ul></ul><ul><ul><li>Identifier is the ordinal value in the domain table </li></ul></ul><ul><ul><li>Store the identifier instead of the pointer </li></ul></ul><ul><ul><li>Use the identifier as an offset into the domain table </li></ul></ul><ul><ul><li>Extendable IDs, length of the identifier grows and shrinks depending on the number of domain values </li></ul></ul>
  6. 6. Storage Management <ul><li>D domain values can be distinguished by identifiers of </li></ul><ul><li>length log 2 D /8 bytes. </li></ul><ul><li>Starting with 1 byte identifiers, the length grows and shrinks. </li></ul><ul><li>ID values are projected out from the rest of the relation and </li></ul><ul><li>stored separately maintaining Positional Indexing. </li></ul><ul><li>Why not bit identifiers? </li></ul><ul><ul><li>Storage is byte addressable. </li></ul></ul><ul><ul><li>Packing bit identifiers in bytes increases the storage management complexity. </li></ul></ul>
  7. 7. Storage Management Relation R ID Values Figure: ID Storage 0 1 2 1 n 0 n v0 v1 vn Domain Values Positional Indexing
  8. 8. Storage Management <ul><li>Ping Pong Effect </li></ul><ul><ul><li>At the boundaries, there is reorganization of ID values </li></ul></ul><ul><ul><li>when the identifier length changes </li></ul></ul><ul><ul><li>Frequent insertions and deletions at the boundaries might </li></ul></ul><ul><ul><li>result in a lot of reorganization </li></ul></ul><ul><ul><li>Phenomena should be avoided </li></ul></ul><ul><li>No deletion of Domain values </li></ul><ul><ul><li>Domain structure means a future insertion might reference </li></ul></ul><ul><ul><li>the deleted value </li></ul></ul><ul><ul><li>Do not delete a domain value even it is not referenced </li></ul></ul><ul><li>Setting a threshold for deletion </li></ul><ul><ul><li>Delete only if number of deletions exceeds a threshold </li></ul></ul><ul><ul><li>Increase the threshold when boundaries are being crossed </li></ul></ul>
  9. 9. Storage Management <ul><li>Primary Key-Foreign Key relationship </li></ul><ul><ul><li>Primary key: A domain in itself </li></ul></ul><ul><ul><li>IDs for primary key values </li></ul></ul><ul><ul><li>Values present in child table are the corresponding primary key IDs </li></ul></ul><ul><ul><li>Projected foreign key column forms a Join Index </li></ul></ul>Child Table Relation S S.B ID Values Figure: Primary Key-Foreign Key Join Index 0 1 2 1 n 0 n v0 v1 vn Parent Table Relation R
  10. 10. Storage Management <ul><li>ID based Storage wins over Domain Storage </li></ul><ul><li>when p > log 2 D /8 </li></ul><ul><li>Relations in a small device do not have a very high cardinality </li></ul><ul><li>Above condition true for most of the data. </li></ul><ul><li>Advantages </li></ul><ul><li>(i) Considerable saving in storage cost. </li></ul><ul><li>(ii) Efficient join between parent table and child table </li></ul>
  11. 11. Storage Management <ul><li>Bitmap Storage </li></ul><ul><ul><li>When the number of domain values is very less compared </li></ul></ul><ul><ul><li>to the number of tuples, e.g., True, False </li></ul></ul><ul><ul><li>Selection on multiple attributes </li></ul></ul><ul><li>A Data + Index Model </li></ul><ul><ul><li>A bitmap index is created for every bitmap attribute </li></ul></ul><ul><ul><li>Attribute values are not stored in the base relation </li></ul></ul><ul><ul><li>The index can be used to retrieve the domain value of each tuple </li></ul></ul><ul><li>Cost of Projection becomes high as is the case with Ring Storage </li></ul><ul><li>Join index of parent table-child table possible by storing </li></ul><ul><li>bitmaps for every primary key value </li></ul>
  12. 12. Storage Management <ul><li>Bitmap Storage not an alternative to Ring Storage </li></ul><ul><li>Indexing capabilities of both models are different </li></ul><ul><li>Depending on attribute characteristics, choose the </li></ul><ul><li>appropriate model </li></ul><ul><li>Memory requirement for selection </li></ul><ul><ul><li>Number of bit vectors is equal to the number of attributes </li></ul></ul><ul><ul><li>that form part of the selection </li></ul></ul><ul><ul><li>Bit vectors in memory </li></ul></ul>
  13. 13. Query Processing <ul><li>Considerations </li></ul><ul><ul><li>Minimize writes to secondary storage </li></ul></ul><ul><ul><li>Efficient usage of limited main memory </li></ul></ul><ul><ul><li>Read buffer not required </li></ul></ul><ul><ul><li>Main memory as write buffer </li></ul></ul><ul><ul><li>If read:write ratio very high, flash memory as write buffer </li></ul></ul><ul><li>Query Plan </li></ul><ul><ul><li>An optimal query plan is needed </li></ul></ul><ul><ul><li>Reduce materialization, if absolutely necessary use main memory </li></ul></ul><ul><ul><li>Bushy trees and right-deep trees are ruled out </li></ul></ul><ul><ul><li>Left deep tree is most suited for pipelined evaluation </li></ul></ul><ul><ul><li>Right operand in a left-deep tree is always a stored relation </li></ul></ul><ul><ul><li>Only one input is pipelined </li></ul></ul>
  14. 14. Query Processing <ul><li>Memory Allocation to Operators </li></ul><ul><ul><li>Limited main memory, cannot assume that the entire memory </li></ul></ul><ul><ul><li>is available for every operator in the left-deep tree plan </li></ul></ul><ul><ul><li>Can the plan be executed with the available memory? </li></ul></ul><ul><ul><li>If nested loop algorithms are used for every operator, minimum </li></ul></ul><ul><ul><li>amount of memory is needed to execute the plan </li></ul></ul><ul><ul><li>Nested loop algorithms are inefficient </li></ul></ul><ul><ul><li>Should memory usage be reduced to a minimum at the </li></ul></ul><ul><ul><li>cost of performance? </li></ul></ul><ul><ul><li>Memory increasing with every new device </li></ul></ul><ul><ul><li>Different devices come with different memory sizes </li></ul></ul><ul><ul><li>Query plans should make extensive use of memory </li></ul></ul><ul><ul><li>Memory must be optimally allocated among all operators </li></ul></ul>
  15. 15. Query Processing <ul><li>Operator evaluation schemes </li></ul><ul><ul><li>Different schemes for an operator </li></ul></ul><ul><ul><li>All have different memory usage and cost </li></ul></ul><ul><ul><li>Schemes conform to left-deep tree query plan </li></ul></ul><ul><ul><li>Cost of a scheme is the computation time </li></ul></ul><ul><li>Schemes for Join </li></ul><ul><ul><li>Nested Loop Join </li></ul></ul><ul><ul><li>Indexed Nested Loop Join </li></ul></ul><ul><ul><li>Hash Join </li></ul></ul><ul><li>Similar schemes for other operators </li></ul>
  16. 16. Query Processing <ul><li>Benefit/Size of a scheme </li></ul><ul><ul><li>Every scheme is characterized by a benefit/size ratio which </li></ul></ul><ul><ul><li>represents its benefit per unit memory allocation </li></ul></ul><ul><ul><li>Minimum scheme for an operator is the scheme that has max. </li></ul></ul><ul><ul><li>cost and min. memory </li></ul></ul><ul><ul><li>Assume n schemes s 1 , s 2 ,…s n to implement an operator o </li></ul></ul><ul><ul><li>min(o)=s min </li></ul></ul><ul><ul><li>i, 1≤i≤n : Cost(s i ) ≤ Cost(s min ) , </li></ul></ul><ul><ul><li>Memory(s i ) ≥ Memory(s min ) </li></ul></ul><ul><ul><li>s min is the minimum scheme for operator o . Then, </li></ul></ul><ul><ul><li>Benefit(s i )=Cost(s min ) – Cost(s i ) </li></ul></ul><ul><ul><li>Size(s i ) =Memory(s i ) – Memory(s min ) </li></ul></ul>A
  17. 17. Query Processing <ul><li>An operator is defined by the benefit and size of its schemes </li></ul><ul><li>Every operator is a collection of (size,benefit) points, n points </li></ul><ul><li>for n schemes </li></ul>Benefit (0,0) (s1,b1) (s2,b2) Figure: (Size, Benefit) points for an operator Size
  18. 18. Query Processing <ul><li>Optimal Memory Allocation </li></ul><ul><ul><li>Determine the amount of memory allocated to each operator </li></ul></ul><ul><ul><li>to get maximum benefit </li></ul></ul><ul><li>2-Phase Approach </li></ul><ul><ul><li>Phase 1: Query is first optimized to get a query plan </li></ul></ul><ul><ul><li>Phase 2: Division of memory among the operators </li></ul></ul><ul><ul><li>Scheme for every operator is determined in phase 1 and remains </li></ul></ul><ul><ul><li>unchanged after phase 2, memory allocation in phase 2 on the </li></ul></ul><ul><ul><li>basis of the cost functions of the schemes </li></ul></ul><ul><ul><li>Memory is assumed to be available for all the schemes, this may </li></ul></ul><ul><ul><li>not be true for a resource constrained device </li></ul></ul>
  19. 19. Query Processing <ul><li>Depending on the available memory, need to determine the </li></ul><ul><li>best scheme for every operator out of all possible ones </li></ul><ul><li>Schemes in phase 1 and after phase 2 need not be the same </li></ul><ul><li>Optimal division of memory involves the decision of selecting </li></ul><ul><li>the best scheme for every operator </li></ul>
  20. 20. Query Processing <ul><li>Our Solution </li></ul><ul><ul><li>We use a heuristic to determine which operator gains </li></ul></ul><ul><ul><li>the most per unit memory allocation and allocate </li></ul></ul><ul><ul><li>memory to that operator </li></ul></ul><ul><ul><li>Gain of every operator is determined by its best </li></ul></ul><ul><ul><li>possible scheme </li></ul></ul><ul><ul><li>Repeat the process till memory allocation is done </li></ul></ul><ul><ul><li>Heuristic: </li></ul></ul><ul><ul><li>Select the scheme that has the maximum benefit/size </li></ul></ul><ul><ul><li>and allocate its memory </li></ul></ul>
  21. 21. Query Processing <ul><li>MemAllocate(M Total ) { </li></ul><ul><li>1. M min = Memory(min(i)) </li></ul><ul><li>2. for i=1 to m do </li></ul><ul><li>3. Scheme(i)=min(i) </li></ul><ul><li>4. end for </li></ul><ul><li>5. M avail = M Total – M min </li></ul><ul><li>6. s best ,o best =GetBestScheme(M avail ) </li></ul><ul><li>7. if no best scheme then return </li></ul><ul><li>8. else { </li></ul><ul><li>9. M avail = M avail – Memory(s best ) + Memory(Scheme(o best )) </li></ul><ul><li>10. Scheme(o best )=s best </li></ul><ul><li>11. RemoveSchemes(s best ,o best ) </li></ul><ul><li>12. RecomputeBenefits(s best ,o best ) </li></ul><ul><li>13. } </li></ul><ul><li>14. goto step 6 </li></ul><ul><li>} </li></ul><ul><li>Complexity = O(nm 2 ), m=no. of operators, n=no. of schemes </li></ul>Σ i=1 m
  22. 22. Query Processing <ul><li>Recomputation of Benefits </li></ul><ul><ul><ul><li>Once the operator o best gets memory Memory(s best ), </li></ul></ul></ul><ul><ul><ul><li>the benefit and size of all the schemes of o best that </li></ul></ul></ul><ul><ul><ul><li>have higher memory than s best change. </li></ul></ul></ul><ul><ul><ul><li>New benefit and size values will be the difference </li></ul></ul></ul><ul><ul><ul><li>between their old values and those of s best. </li></ul></ul></ul>Benefit Size (0,0) (s1,b1) (s2,b2) (s2-s1) (b2-b1) Scheme 1 has highest benefit/size ratio Benefit(Scheme 2)=(b2-b1) Size(Scheme 2)=(s2-s1) Figure: Benefit and Size Recomputation
  23. 23. Query Processing <ul><li>1 Phase Approach </li></ul><ul><li>The 2-phase solution optimally allocates memory to all the </li></ul><ul><li>operators in the query plan. </li></ul><ul><li>However, the plan itself might be suboptimal for the given </li></ul><ul><li>available memory. </li></ul><ul><li>1-phase approach takes into account memory division </li></ul><ul><li> among operators while choosing between plans. </li></ul><ul><li>Ideally, 1-phase optimization should be done but the </li></ul><ul><li> optimizer becomes complex. </li></ul><ul><li> </li></ul>
  24. 24. Future Work <ul><li>Implementation Status </li></ul><ul><ul><li>1. Flat Storage, Domain Storage, Ring Storage, and ID Storage </li></ul></ul><ul><ul><li>2. Join algorithms </li></ul></ul><ul><li>Future Work </li></ul><ul><ul><li>Bitmap Storage implementation </li></ul></ul><ul><ul><li>Algorithms for aggregation </li></ul></ul><ul><ul><li>Query optimizer and the iterator </li></ul></ul><ul><ul><li>Test using sample relations and data from handheld apps </li></ul></ul><ul><ul><li>Examine the feasibility of a 1-phase optimizer </li></ul></ul><ul><ul><li>Database Module Toolkit </li></ul></ul><ul><ul><li>An operator that returns first-k results of a query </li></ul></ul><ul><ul><li>Application specific DBMS </li></ul></ul>
  25. 25. Thank You
  26. 26. References <ul><li>A. Ammann, M. Hanrahan, and R. Krishnamurthy. Design of a Memory Resident DBMS. In IEEE COMPCON, 1985. </li></ul><ul><li>2. C. Bobineau, L. Bouganim, P. Pucheral, and P. Valduriez. PicoDBMS: Scaling down Database Techniques for the Smartcard. In VLDB, 2000. </li></ul><ul><li>3. Stephen Blott and Henry F. Korth. An Almost Serial Protocol for Transaction Execution in Main Memory Database Systems. In VLDB, 2002. </li></ul><ul><li>4. DB2 Everyplace. http://www.ibm.com/software/data/db2/everyplace. </li></ul><ul><li>5. Anindya Datta, Debra VanderMeer, Krithi Ramamritham, and Bongki Moon. Applying Parallel Processing Techniques in Data Warehousing and OLAP. In VLDB, 1999. </li></ul><ul><li>6. A. Hulgeri, S. Sudarshan, and S. Seshadri. Memory Cognizant Query Optimization. In Advances In Data Management, 2000. </li></ul>
  27. 27. References <ul><li>7. Arthur M. Keller. Algorithms for Translating View Updates to </li></ul><ul><li>Database Updates for Views Involving Selections, Projections and </li></ul><ul><li>Joins. In ACM PODS, 1985. </li></ul><ul><li>8. Rom Langerak. View Updates in Relational Databases with an </li></ul><ul><li>Independent Scheme. In ACM PODS, 1990. </li></ul><ul><li>T. Lehmann and M. Carey. A Study of Index Structures for Main </li></ul><ul><li>Memory DBMS. In VLDB, 1986. </li></ul><ul><li>10. M. Missikov and M. Scholl. Relational Queries in a Domain Based </li></ul><ul><li>DBMS. In ACM SIGMOD, 1983. </li></ul><ul><li>Mysql. http://www.mysql.com. </li></ul><ul><li>12. P. Pucheral, P. Valduriez, and J.M.Thevenin. EÆcient Main </li></ul><ul><li>Memory Data Management using the DBGraph Storage Model. In </li></ul><ul><li>VLDB, 1990. </li></ul><ul><li>13. The Simputer. http://www.simputer.org. </li></ul>
  28. 28. <ul><li>A </li></ul>Σ

×