Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. An open source DBMS for handheld devices by Rajkumar Sen IIT Bombay Under the guidance of Prof. Krithi Ramamritham
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Storage Management </li></ul><ul><li>Query Processing </li></ul><ul><li>Other issues </li></ul><ul><li>Performance Evaluation </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Introduction <ul><li>A resource constrained device </li></ul><ul><ul><li>A small computer with limited resources </li></ul></ul><ul><ul><li>e.g. Cellphones, Simputer, Palm devices etc . </li></ul></ul><ul><li>Data management is important </li></ul><ul><ul><li>Increasing number of applications </li></ul></ul><ul><ul><li>They deal with a fair amount of data </li></ul></ul><ul><ul><li>Complex queries involving joins and aggregates </li></ul></ul><ul><ul><li>Atomicity and Durability for data consistency </li></ul></ul><ul><ul><li>Ease of application development </li></ul></ul><ul><li>A device resident DBMS is needed </li></ul>
  4. 4. Introduction <ul><li>Need for Synchronization </li></ul><ul><ul><li>Data from remote server downloaded on the device </li></ul></ul><ul><ul><li>Updates at both places </li></ul></ul><ul><ul><li>Common data needs to be synchronized </li></ul></ul><ul><li>Challenges </li></ul><ul><ul><li>Limited computing power and main memory </li></ul></ul><ul><ul><li>Limited stable storage </li></ul></ul><ul><ul><li>Resources are not uniform across devices </li></ul></ul><ul><ul><li>Need a system that can do the best for every device </li></ul></ul>
  5. 5. Introduction <ul><li>Storage Management </li></ul><ul><ul><li>Reduce storage cost to a minimum </li></ul></ul><ul><ul><li>Limited storage could preclude any additional index </li></ul></ul><ul><ul><li>Data model should try to incorporate some index information </li></ul></ul><ul><li>Query Processing </li></ul><ul><ul><li>Memory limits the query processing capabilities </li></ul></ul><ul><ul><li>Minimum memory algorithms in existing systems does not work well for complex joins and aggregates </li></ul></ul><ul><ul><li>Need algorithms that create in-memory indices and save aggregate values </li></ul></ul><ul><ul><li>Optimal memory allocation among operators </li></ul></ul>
  6. 6. Storage Management <ul><li>Aim at compactness in representation of data </li></ul><ul><li>Existing storage models </li></ul><ul><ul><li>Flat Storage </li></ul></ul><ul><ul><ul><li>Tuples are stored sequentially. Ensures access locality but </li></ul></ul></ul><ul><ul><ul><li>consumes space. </li></ul></ul></ul><ul><ul><li>Pointer-based Domain Storage </li></ul></ul><ul><ul><ul><li>Values partitioned into domains which are sets of unique values </li></ul></ul></ul><ul><ul><ul><li>Tuples reference the attribute value by means of pointers </li></ul></ul></ul><ul><ul><ul><li>One domain shared among multiple attributes </li></ul></ul></ul><ul><li>In Domain Storage, pointer of size p (typically 4 bytes) to </li></ul><ul><ul><li>point to the domain value. </li></ul></ul><ul><ul><li>Can we further reduce the storage cost? </li></ul></ul>
  7. 7. Storage Management <ul><li>ID Storage : </li></ul><ul><ul><li>An identifier for each of the domain values </li></ul></ul><ul><ul><li>Identifier is the ordinal value in the domain table </li></ul></ul><ul><ul><li>Store the identifier instead of the pointer </li></ul></ul><ul><ul><li>Use the identifier as an offset into the domain table </li></ul></ul><ul><ul><li>Extendable IDs, length of the identifier grows and shrinks depending on the number of domain values </li></ul></ul>
  8. 8. Storage Management <ul><li>D domain values can be distinguished by identifiers of </li></ul><ul><li>length log 2 D /8 bytes. </li></ul><ul><li>Starting with 1 byte identifiers, the length grows and shrinks. </li></ul><ul><li>ID values are projected out from the rest of the relation and </li></ul><ul><li>stored separately maintaining Positional Indexing. </li></ul><ul><li>Why not bit identifiers? </li></ul><ul><ul><li>Storage is byte addressable. </li></ul></ul><ul><ul><li>Packing bit identifiers in bytes increases the storage management complexity. </li></ul></ul>
  9. 9. Storage Management Relation R ID Values Figure: ID Storage 0 1 2 1 n 0 n v0 v1 vn Domain Values Positional Indexing
  10. 10. Storage Management <ul><li>Ping Pong Effect </li></ul><ul><ul><li>At the boundaries, there is reorganization of ID values </li></ul></ul><ul><ul><li>when the identifier length changes </li></ul></ul><ul><ul><li>Frequent insertions and deletions at the boundaries might </li></ul></ul><ul><ul><li>result in a lot of reorganization </li></ul></ul><ul><ul><li>Phenomena should be avoided </li></ul></ul><ul><li>No deletion of Domain values </li></ul><ul><ul><li>Domain structure means a future insertion might reference </li></ul></ul><ul><ul><li>the deleted value </li></ul></ul><ul><ul><li>Do not delete a domain value even it is not referenced </li></ul></ul><ul><li>Setting a threshold for deletion for domain values </li></ul><ul><ul><li>Delete only if number of deletions exceeds a threshold </li></ul></ul><ul><ul><li>Increase the threshold when boundaries are being crossed </li></ul></ul>
  11. 11. Storage Management <ul><li>Primary Key-Foreign Key relationship </li></ul><ul><ul><li>Primary key: A domain in itself </li></ul></ul><ul><ul><li>IDs for primary key values </li></ul></ul><ul><ul><li>Values present in child table are the corresponding primary key IDs </li></ul></ul><ul><ul><li>Projected foreign key column forms a Join Index </li></ul></ul>Child Table Relation S S.B ID Values Figure: Primary Key-Foreign Key Join Index 0 1 2 1 n 0 n v0 v1 vn Parent Table Relation R
  12. 12. Storage Management <ul><li>ID based Storage wins over Domain Storage </li></ul><ul><li>when p > log 2 D /8 </li></ul><ul><li>Relations in a small device do not have a very high cardinality </li></ul><ul><li>Above condition true for most of the data. </li></ul><ul><li>Advantages </li></ul><ul><li>(i) Considerable saving in storage cost. </li></ul><ul><li>(ii) Efficient join between parent table and child table </li></ul>
  13. 13. Query Processing <ul><li>Considerations </li></ul><ul><ul><li>Minimize writes to secondary storage </li></ul></ul><ul><ul><li>Efficient usage of limited main memory </li></ul></ul><ul><ul><li>Read buffer not required </li></ul></ul><ul><ul><li>Main memory as write buffer </li></ul></ul><ul><ul><li>If read:write ratio very high, flash memory as write buffer </li></ul></ul><ul><li>Need for Left-deep Query Plan </li></ul><ul><ul><li>Reduce materialization, if absolutely necessary use main memory </li></ul></ul><ul><ul><li>Bushy trees and right-deep trees are ruled out </li></ul></ul><ul><ul><li>Left deep tree is most suited for pipelined evaluation </li></ul></ul><ul><ul><li>Right operand in a left-deep tree is always a stored relation </li></ul></ul>
  14. 14. Query Processing <ul><li>Need for optimal memory allocation </li></ul><ul><ul><li>If nested loop algorithms are used for every operator, </li></ul></ul><ul><ul><li>minimum amount of memory is needed to execute the plan </li></ul></ul><ul><ul><li>Nested loop algorithms are inefficient </li></ul></ul><ul><ul><li>Should memory usage be reduced to a minimum at the </li></ul></ul><ul><ul><li>cost of performance? </li></ul></ul><ul><ul><li>Different devices come with different memory sizes </li></ul></ul><ul><ul><li>Query plans should make efficient use of memory </li></ul></ul><ul><ul><li>Memory must be optimally allocated among all operators </li></ul></ul><ul><li>Need to generate the best query execution plan depending on </li></ul><ul><li>the available memory </li></ul>
  15. 15. Query Processing <ul><li>Operator evaluation schemes </li></ul><ul><ul><li>Different schemes for an operator </li></ul></ul><ul><ul><li>All have different memory usage and cost </li></ul></ul><ul><ul><li>Schemes conform to left-deep tree query plan </li></ul></ul><ul><ul><li>Cost of a scheme is the computation time </li></ul></ul>
  16. 16. Query Processing <ul><li>Schemes for Join </li></ul><ul><ul><li>Nested Loop Join </li></ul></ul><ul><ul><li>Indexed Nested Loop Join </li></ul></ul><ul><ul><li>Hash Join </li></ul></ul><ul><ul><li>Using Join Index </li></ul></ul><ul><li>Schemes for aggregation </li></ul><ul><ul><li>Nested Loop aggregation </li></ul></ul><ul><ul><li>Buffered aggregation </li></ul></ul><ul><li>Operator schemes implemented using the Iterator Model </li></ul>
  17. 17. Query Processing <ul><li>Benefit/Size of a scheme </li></ul><ul><ul><li>Every scheme is characterized by a benefit/size ratio which </li></ul></ul><ul><ul><li>represents its benefit per unit memory allocation </li></ul></ul><ul><ul><li>Minimum scheme for an operator is the scheme that has max. </li></ul></ul><ul><ul><li>cost and min. memory </li></ul></ul><ul><ul><li>Assume n schemes s 1 , s 2 ,…s n to implement an operator o </li></ul></ul><ul><ul><li>min(o)=s min </li></ul></ul><ul><ul><li>i, 1≤i≤n : Cost(s i ) ≤ Cost(s min ) , </li></ul></ul><ul><ul><li>Memory(s i ) ≥ Memory(s min ) </li></ul></ul><ul><ul><li>s min is the minimum scheme for operator o . Then, </li></ul></ul><ul><ul><li>Benefit(s i )=Cost(s min ) – Cost(s i ) </li></ul></ul><ul><ul><li>Size(s i ) =Memory(s i ) – Memory(s min ) </li></ul></ul>A
  18. 18. Query Processing <ul><li>Every operator is a collection of (size,benefit) points, n points for n schemes </li></ul><ul><li>Operator cost function is the collection of (cost, memory) </li></ul><ul><li>points of its schemes </li></ul>Benefit (0,0) (s1,b1) (s2,b2) Figure: (Size, Benefit) points for an operator Size Memory Cost (0,c1) (m2,c2) (m3,c3) (0,0) Figure: Operator cost function
  19. 19. Query Processing <ul><li>Optimal Memory Allocation </li></ul><ul><li>2-Phase Approach </li></ul><ul><ul><li>Phase 1: Query is first optimized to get a query plan </li></ul></ul><ul><ul><li>Phase 2: Division of memory among the operators </li></ul></ul><ul><ul><li>Scheme for every operator is determined in phase 1 and remains </li></ul></ul><ul><ul><li>unchanged after phase 2, memory allocation in phase 2 on the </li></ul></ul><ul><ul><li>basis of the cost functions of the schemes </li></ul></ul><ul><ul><li>Memory is assumed to be available for all the schemes, this may </li></ul></ul><ul><ul><li>not be true for a resource constrained device </li></ul></ul><ul><li>Traditional 2-phase optimization cannot be used </li></ul>
  20. 20. Query Processing <ul><li>Optimal Memory Allocation </li></ul><ul><li>1-Phase Approach </li></ul><ul><ul><li>Query optimizer is made memory cognizant </li></ul></ul><ul><ul><li>Modified optimizer takes into account division of memory among operators while choosing between plans </li></ul></ul><ul><li>Ideally, 1-phase optimization should be done but the </li></ul><ul><li>optimizer becomes complex. </li></ul>
  21. 21. Query Processing <ul><li>Modified 2-phase optimizer </li></ul><ul><li>Optimal division of memory involves the decision of selecting </li></ul><ul><li>the best scheme for every operator </li></ul><ul><li>Phase 1: Determine the optimal left-deep join order using </li></ul><ul><li> dynamic programming approach </li></ul><ul><li>Phase 2: a) Divide memory among the operators </li></ul><ul><li>b) Choose the scheme for every operator depending on the memory allocated </li></ul>
  22. 22. Query Processing <ul><li>Exact memory allocation </li></ul><ul><li>Hulgeri et al proposed an exact solution to the </li></ul><ul><li>memory allocation problem </li></ul><ul><ul><ul><li>Traditional 2-phase optimization </li></ul></ul></ul><ul><ul><ul><li>Divides memory among operator schemes, schemes selected in phase 1 </li></ul></ul></ul><ul><ul><ul><li>Algorithm to divide memory among linear piecewise cost functions </li></ul></ul></ul><ul><ul><ul><li>Optimal division of memory takes place only at change-over points </li></ul></ul></ul>
  23. 23. Query Processing <ul><li>Exact memory allocation </li></ul><ul><ul><ul><li>Our operator cost functions are also piecewise linear functions </li></ul></ul></ul><ul><ul><ul><li>Exact algorithm can be used by replacing scheme cost function with operator cost function </li></ul></ul></ul><ul><ul><ul><li>Division of memory among operator cost functions </li></ul></ul></ul><ul><ul><ul><li>Amount of memory allocated to each operator will exactly match one of its schemes </li></ul></ul></ul>
  24. 24. Query Processing <ul><li>Heuristic memory allocation </li></ul><ul><ul><li>A heuristic to determine which operator gains </li></ul></ul><ul><ul><li>the most per unit memory allocation and allocate </li></ul></ul><ul><ul><li>memory to that operator </li></ul></ul><ul><ul><li>Gain of every operator is determined by its best </li></ul></ul><ul><ul><li>feasible scheme </li></ul></ul><ul><ul><li>Repeat the process till memory allocation is done </li></ul></ul><ul><ul><li>Heuristic: </li></ul></ul><ul><ul><li>Select the scheme that has the maximum benefit/size ratio </li></ul></ul>
  25. 25. Query Processing <ul><li>MemAllocate(M Total ) { </li></ul><ul><li>1. M min = Σ Memory(min(i)) </li></ul><ul><li>2. for i=1 to m do </li></ul><ul><li>3. Scheme(i)=min(i) </li></ul><ul><li>4. M avail = M Total – M min </li></ul><ul><li>5. RemoveSchemes(M avail ) </li></ul><ul><li>6. s best ,o best =GetBestScheme(M avail ) </li></ul><ul><li>7. if no best scheme then return </li></ul><ul><li>8. else { </li></ul><ul><li>9. M avail = M avail - Memory(s best ) + Memory(Scheme(o best )) </li></ul><ul><li>10. Scheme(o best ) = s best </li></ul><ul><li>11. RemoveSchemes(s best ,o best, M avail ) </li></ul><ul><li>12. RecomputeBenefits(s best ,o best ) </li></ul><ul><li>13. } </li></ul><ul><li>14. goto step 6 </li></ul><ul><li>} </li></ul>i=1 m
  26. 26. Query Processing <ul><li>Recomputation of Benefits </li></ul><ul><ul><ul><li>Once the operator o best gets memory Memory(s best ), </li></ul></ul></ul><ul><ul><ul><li>the benefit and size of all the schemes of o best that </li></ul></ul></ul><ul><ul><ul><li>have higher memory than s best change. </li></ul></ul></ul><ul><ul><ul><li>New benefit and size values will be the difference </li></ul></ul></ul><ul><ul><ul><li>between their old values and those of s best. </li></ul></ul></ul>Benefit Size (0,0) (s1,b1) (s2,b2) (s2-s1) (b2-b1) Scheme 1 has highest benefit/size ratio Benefit(Scheme 2)=(b2-b1) Size(Scheme 2)=(s2-s1) Figure: Benefit and Size Recomputation
  27. 27. Some other issues <ul><ul><li>Data Synchronization </li></ul></ul><ul><ul><ul><li>Record the changes in a log </li></ul></ul></ul><ul><ul><ul><li>Merge the changes with the main server </li></ul></ul></ul><ul><ul><ul><li>Conflict detection and Conflict resolution </li></ul></ul></ul><ul><ul><li>Concurrency control </li></ul></ul><ul><ul><ul><li>Local transaction on the device, transaction doing data synchronization </li></ul></ul></ul><ul><ul><ul><li>Minimum concurrency control needed </li></ul></ul></ul><ul><ul><li>Access Rights Management </li></ul></ul><ul><ul><ul><li>Community handhelds like Simputer </li></ul></ul></ul><ul><ul><ul><li>More than a single user </li></ul></ul></ul>
  28. 28. Implementation Status <ul><ul><li>Developed in C programming language </li></ul></ul><ul><ul><li>Code base distributed over several subdirs </li></ul></ul><ul><ul><li>Recursive makefiles to build the system </li></ul></ul><ul><ul><li>Lex and Bison used to write the SQL parser </li></ul></ul><ul><ul><li>Storage Manager, Query Optimizer and Query Executor implemented </li></ul></ul><ul><ul><li>Supports CHAR, INTEGER AND FLOAT </li></ul></ul><ul><ul><li>Select, Project, Join, and COUNT </li></ul></ul><ul><ul><li>ID based Join Index and other aggregate operators not completed </li></ul></ul>
  29. 29. Performance Evaluation <ul><li>Experimental setup </li></ul><ul><ul><li>Database system ported to the Simputer, a handheld device </li></ul></ul><ul><ul><li>Sample healthcare schema and datasets </li></ul></ul><ul><ul><ul><li>Doctor (91), Drug(77), Visit(830), Prescription(2155) </li></ul></ul></ul><ul><ul><li>Q1: 3 joins and 2 selections </li></ul></ul><ul><ul><li>Q4: 3 joins and aggregation over two attributes </li></ul></ul><ul><ul><li>Data stored in Flat Storage and ID Storage without Join Index </li></ul></ul><ul><ul><li>Exact and heuristic memory allocation </li></ul></ul><ul><ul><li>Response time measured by varying the amount of memory </li></ul></ul>
  30. 30. Performance Evaluation
  31. 31. Performance Evaluation
  32. 32. Performance Evaluation <ul><li>Conclusions </li></ul><ul><ul><li>Response times highest with minimum memory and least with maximum memory </li></ul></ul><ul><ul><li>Computing power of the handheld affects the response time in a big way </li></ul></ul><ul><ul><li>Heuristic memory allocation differed from exact algorithm in a few points only </li></ul></ul><ul><ul><li>Response times more for ID Storage due to extra cost in projection </li></ul></ul><ul><ul><li>Nested loop aggregation is very costly </li></ul></ul><ul><ul><li>Join Index should reduce the query execution time </li></ul></ul>
  33. 33. Summary <ul><li>Storage Manager, Optimizer and Executor implemented </li></ul><ul><li>Supports SPJ and COUNT operators </li></ul><ul><li>Contributions </li></ul><ul><ul><li>A new storage model, ID based Storage </li></ul></ul><ul><ul><li>Highlighted the need for optimal memory allocation </li></ul></ul><ul><ul><li>Existing Exact allocation algorithm used with some modifications </li></ul></ul><ul><ul><li>Heuristic memory allocation algorithm </li></ul></ul><ul><ul><li>Selection of best query execution plan depending on memory available in a device </li></ul></ul>
  34. 34. Ongoing and Future work <ul><li>Ongoing Work </li></ul><ul><ul><li>Data synchronization utility </li></ul></ul><ul><ul><li>Remaining aggregate operators </li></ul></ul><ul><ul><li>ID based Join Index </li></ul></ul><ul><ul><li>Integration with AQUA, an online database backed discussion forum </li></ul></ul><ul><li>Future Work </li></ul><ul><ul><li>Feasibility of a 1-phase optimizer </li></ul></ul><ul><ul><li>DBMS module toolkit </li></ul></ul><ul><ul><li>An operator that returns first-k results of a query </li></ul></ul><ul><ul><li>Application specific DBMS </li></ul></ul>
  35. 35. Thank You
  36. 36. Performance Evaluation
  37. 37. Performance Evaluation
  38. 38. Performance Evaluation
  39. 39. Performance Evaluation
  40. 40. Performance Evaluation
  41. 41. Performance Evaluation
  42. 42. Performance Evaluation
  43. 43. Performance Evaluation