Stage1Ash.ppt
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Stage1Ash.ppt

on

  • 420 views

 

Statistics

Views

Total Views
420
Views on SlideShare
420
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Volume is High: Aqua is about 6MB Simple query: get the names of pesticides for crop disease x Complex query: the amount of money got by selling the crops
  • Why not bit identifiers? Storage is byte addressable. Packing bit identifiers in bytes increases the storage management complexity.
  • Cpu intensive query optimization eg
  • Smart card: 1. Memory card 2. Microprocessor card
  • Disk locality : copying the page to another location on the disk. Principle of storing related data together to reduce the disk arm movement (seek time) is violated Recovery difficult redo and undo required Cost increases with flash size shadow table entries size of shadow
  • Presumed commit. Reduces number of messages exchanged during commit
  • 1PC : Unilateral commit for mobile (UCM) Lesser forced writes: 1 round only, no need to write ready to the disk We do not know what protocol is being used in the local dbms We cant control even if we know

Stage1Ash.ppt Presentation Transcript

  • 1. Memory Constrained DBMS with Updates Ashwini G. Rao Guide Prof. Krithi Ramamritham
  • 2. Outline of the talk
    • Need for Handheld DBMS
    • New Issues in Implementation
    • Project Goals
    • Review of Existing Work
    • Compression in Storage
    • Transaction Management
    • Synchronization
    • Current Implementation Status
    • Conclusions and Future work
  • 3. Handhelds
    • Small, Convenient, Carry anywhere
    • Powerful
      • E.g. Simputer- 206MHz, 32MB SDRAM, 24 MB Flash memory, LCD display, Smart card
    • Applications
      • Personal Info Management
        • E-dairy
      • Enterprise Applications
        • Health-care, Micro-banking
  • 4. Need for Handheld DBMS
    • Handheld applications
      • Volume of data is high
      • Simple and Complex Queries
        • select, project, aggregate
      • ACID properties of transactions
      • Require Data Privacy
      • Need Synchronization
    • Database management techniques are needed to meet the above requirements
  • 5. New Issues in Implementation
    • Handheld DBMS vs. Disk DBMS
      • Handheld DB is Flash memory based
        • Disk read time is very small
      • Storage model should consider small memory and computation power
      • Transaction management and synchronization have to consider disconnections, mobility and communication cost
      • Handheld Operating System provides lesser facilities
        • E.g. no multi-threading support in PalmOS
      • Better security measures are required as handhelds are easily stolen, damaged and lost
  • 6. Project Goals
    • Existing work
      • Storage models
      • Query processing & optimization
      • Executor
    • My work
      • Compression in Storage
      • Transaction management
      • Synchronization
  • 7. Existing Work – Review
    • Storage Management
      • Aim at compactness in representation of data
      • Limited storage could preclude any additional index
        • Data model should try to incorporate some index information
    • Query Processing
      • Minimize writes to secondary storage
      • Efficient usage of limited main memory
  • 8. Storage Management
    • Existing storage models
      • Flat Storage
        • Tuples are stored sequentially. Duplicates not eliminated
      • Pointer-based Domain Storage
        • Values partitioned into domains which are sets of unique values
        • Tuples reference the attribute value by means of pointers
        • One domain shared among multiple attributes
  • 9. Storage Management (cont) Flat Storage Domain Storage
    • In Domain Storage, pointer of size p (typically 4 bytes) points to the domain value. Can we further reduce the storage cost?
    10 20 30 40 p q s r IIT12 Flat Relation CSE11 CSE11 CSE11 CSE11 10 20 30 40 p q r s Domain Relation 4 bytes IIT12
  • 10. ID Based Storage Relation R ID Values 0 1 2 1 n 0 n v0 v1 vn Domain Values Positional Indexing
  • 11. ID Based Storage
    • ID Storage
      • An identifier for each of the domain values
      • Store the smaller identifier instead of the pointer
      • Identifier is the positional value in the domain table. Use it as an offset into the domain table
      • D domain values can be distinguished by identifiers of length log 2 D /8 bytes.
  • 12. ID Storage (cont)
      • Extendable IDs are used. Length of the identifier grows and shrinks depending on the number of domain values
      • Starting with 1 byte identifiers, the length grows and shrinks.
      • To reduce reorganization of data, ID values are projected out from the rest of the relation and stored separately maintaining Positional Indexing.
  • 13. ID Storage (cont)
    • Ping Pong Effect
      • At the boundaries, there is reorganization of ID values
      • when the identifier length changes
      • Frequent insertions and deletions at the boundaries might
      • result in a lot of reorganization
      • Phenomena should be avoided
    • No deletion of Domain values
      • Domain structure means a future insertion might reference
      • the deleted value
      • Do not delete a domain value even it is not referenced
    • Setting a threshold for deletion for domain values
      • Delete only if number of deletions exceeds a threshold
      • Increase the threshold when boundaries are being crossed to reduce ping pong effect
  • 14. ID Storage (cont)
    • Primary Key-Foreign Key relationship
      • Primary key is a domain in itself
      • IDs for primary key values
      • Values present in child table are the corresponding primary key IDs
      • Projected foreign key column forms a Join Index
    Figure: Primary Key-Foreign Key Join Index 0 1 2 1 n 0 n v0 v1 vn Parent Table Relation R Child Table
  • 15. ID Storage (cont)
    • ID based Storage wins over Domain Storage when pointer size > log 2 D /8
    • Relations in a small device do not have a very high cardinality Above condition true for most of the data.
    • Advantages of ID storage
      • Considerable saving in storage cost.
      • Efficient join between parent table and child table
  • 16. Query Processing
    • Considerations
      • Minimize writes to secondary storage
      • Use Main memory as write buffer
    • Need for Left-deep Query Plan
      • Reduce materialization in flash memory. If absolutely necessary use main memory
      • Bushy trees use materialization
      • Left deep tree is most suited for pipelined evaluation
      • Right operand in a left-deep tree is always a stored relation
  • 17. Query Processing (cont)
    • Need for optimal memory allocation
      • Using nested loop algorithms for every operator ensures that minimum amount of memory used to execute the plan
      • Nested loop algorithms are inefficient
      • Different devices come with different memory sizes
      • Query plans should make efficient use of memory. Memory must be optimally allocated among all operators
    • Need to generate the best query execution plan depending on the available memory
  • 18. Query Processing (cont)
    • Operator evaluation schemes
      • Different schemes for an operator
      • Schemes conform to left-deep tree query plan
      • All have different memory usage and cost
      • Cost of a scheme is the computation time
  • 19. Query Processing (cont)
    • 2-Phase optimizer
      • Phase 1: Query is first optimized to get a query plan
      • Phase 2: Division of memory among the operators
      • Scheme for every operator is determined in phase 1 and remains unchanged after phase 2, memory allocation in phase 2 is on the basis of the cost functions of the schemes
      • Memory is assumed to be available for all the schemes, this may not be true for a resource constrained device
    • Traditional 2-phase optimization cannot be used
  • 20. Query Processing (cont)
    • 1-Phase optimizer
      • Query optimizer is made memory cognizant
      • Modified optimizer takes into account division of memory among operators while choosing between plans
      • Ideally, 1-phase optimization should be done but the optimizer becomes complex.
  • 21. Query Processing (cont)
    • Modified 2-phase optimizer
      • Optimal division of memory involves the decision of selecting the best scheme for every operator
      • Phase 1:
        • Determine the optimal left-deep join order using dynamic programming approach
      • Phase 2:
        • Divide memory among the operators
        • Choose the scheme for every operator depending on the memory allocated
  • 22. Query Processing (cont)
    • Memory allocation algorithms
      • Exact memory allocation
      • Heuristic memory allocation
    • Conclusions
      • Response times highest with minimum memory and least with maximum memory
      • Computing power of the handheld affects the response time in a big way
      • Heuristic memory allocation differed from exact algorithm in a few points only
  • 23. Compression in DB
    • Advantages
      • Saves space
      • Reduces read time and write time as less data is processed
      • Logging consumes less space and time
    • Disadvantages
      • CPU intensive
      • Competes with other CPU intensive DBMS tasks.
      • May slow down the DBMS
  • 24. Compression in Disk DB
    • Main assumption
      • The high disk read time compensates for the extra time required for compression and decompression
      • E.g. Let time taken to read 10 blocks of data from the disk be 10ms. Let the time taken for compression and decompression be 5ms. After compression 10 blocks occupy only 1 block.
      • Processing time with compression/decompression
        • = ( 1ms + 5ms) = 6ms
    • Handheld DB is Flash memory based
      • Read time is very less. Above assumption is no longer valid!!
  • 25. Compression in Handhelds
    • Techniques can exploit high write time of flash memory
    • Logging
      • Compressed records consume lesser log space
      • Writing time is reduced
      • Decompression done when recovery is initiated
        • Highly beneficial if failures are rare
    • Saves communication cost when log records have to be sent over the network
      • E.g., Transaction management
  • 26. Compression in Handhelds (cont)
    • Data compression in Smart cards
      • Consider Handheld with Smart card support
      • Data stored in smart cards is accessed and updated
        • E.g., Personal database
      • Memory in smart cards is limited
      • Compression will save space
      • Data can be decompressed and processed in the handheld
  • 27. Transaction Management
    • Ensure ACID properties of local and global transactions
      • Local transaction - Update address book entry in Simputer
      • Global transaction - Transfer money from a bank account to an epurse in a smart card attached to a Simputer
    • Issues
      • Frequent disconnections, resource constraints, mobility, loss or damage to handheld
  • 28.
    • We will Look into
      • Concurrency control
      • Atomicity
        • Local
        • Global
      • Consistency
      • Durability
    Transaction Management (cont)
  • 29. Concurrency control
    • Concurrency in handhelds depends on
      • Multi-tasking support from the handheld OS
        • E.g., Linux in Simputer, PalmOS
      • User requirements
        • Several tasks may have to execute concurrently
        • E.g., A periodic synchronization task, address book access and an aggregation operation may run concurrently.
    • Strict 2PL, table level locks can be used
      • Small number of concurrent processes
      • Very few data conflicts
      • Table level locking has small overhead and allows non conflicting processes to continue execution
  • 30. Atomicity
    • Ensure the All or nothing property
    • Local atomicity
      • E.g., enter name, email, phone number in the address book of Simputer
      • Shadow based update vs. In place update
    • Global atomicity
      • E.g., In an epurse application the updates are made at the bank's server, the Simputer and the smart card
      • 2PC, optimizations to 2PC, 1PC
  • 31. Local atomicity
    • Shadow based update
      • Advantages
        • No disk locality problem in handheld DB
        • Simplifies recovery
      • Disadvantages
        • Poorly adopted to Pointer based storage models
        • Cost increases with increase in size of flash memory
    • In place update
      • Uses WAL
      • Accommodates Pointer based storage models
      • Cost does not increase with size of flash memory
      • Buffer replacement policy is Steal
        • Dirty blocks can be written to Smart card storage to avoid Undo
  • 32.
    • Two Phase Commit (2PC)
      • Most commonly used atomic commit protocol
      • Shortcomings in handheld scenario
        • Two rounds (decision and voting) of messages imposes high communication overhead
        • Requires the handheld to be connected during the voting and decision phase
        • Large number of forced writes
    • Optimizations to 2PC
      • Presumed commit
      • Presumed abort
    Global atomicity
  • 33.
    • One Phase Commit (1PC)
      • Advantages
        • Only one round of messages- no voting phase
        • Handheld can disconnect as soon as log records are transferred to fixed server
        • Lesser number of forced writes
        • Transactions involving Smart card and Handheld can use 1PC
      • Disadvantages
        • Requires participants to enforce 2PL. Will work with weak levels of consistency under certain conditions. In heterogeneous environment it is difficult to control the local DBMS concurrency control policies.
    Global atomicity (cont)
  • 34. Consistency and Durability
    • Consistency
      • Local consistency can be ensured by defining integrity constraints
    • Durability
      • Either the changes of the transaction or enough information about the changes are written to stable storage before the transaction commits
      • Network durability- transfer log records to a server on the fixed network.
      • 1PC ensures network durability
      • Pointer based logging
      • Extended ephemeral logging
  • 35. Synchronization
    • Access data Anytime and Anywhere using the handheld
      • Mobile sales person, Wireless ware house
    • Problem – Not possible to remain connected always
    • Solution- Replicate data in the handheld
      • Download a copy of the data into the handheld from the remote server and process it offline. Periodically merge the changes with the server
  • 36. Synchronization -Issues
    • Data replication can lead to conflicts
      • Update-update, Update-delete, Unique key violation, Integrity constraint violation
    • Maintain global consistency between replicated copies
      • Strict consistency with Data partitioning
      • Strict consistency with Reservation protocols or Leases
        • Efficient when data is rarely shared
      • Weak consistency with Eventual consistency
        • leases restrictive when data is shared between many copies
        • Independently access and update data
        • only tentative commits possible
        • Actual commit when transaction is executed at the server
  • 37. Synchronization – Issues (cont)
    • Application specific conflict detection and resolution
      • Maximum flexibility
    • Device, network and backend agnostic
      • XML, Unicode
    • Incremental maintenance
      • Save communication cost
    • Download parts of relations, i.e., views
  • 38. Synchronization –Existing Models
    • Publish Subscribe Model
      • Three tier
      • Enterprise applications
      • Independent updates
      • Eventual consistency
      • Conflict detection, resolution and merge
    • PC to Handheld Model
      • Two tier
      • Personal information
  • 39. Publish Subscribe Model
    • Eventual consistency model
      • Merge replication in Win SQL CE, Oracle Lite
    • Publish Subscribe Process
      • Publication and article
      • Publishing
      • Subscribing
      • Subscription
      • Synchronization
      • Merging
  • 40. Publish Subscribe Architecture
    • Application
    • SQL DB Engine
    • SQL Database
    • Client Agent
    • Server Agent
    • Merge Agent
      • Conflict Detection
      • Conflict Resolution
    • Replication Provider
    • SQL Server Database
    • Communication Link
  • 41. Conflict Detection and Resolution
    • Conflict detection
      • Row level tracking
      • Associate RowID and Version with each row
      • RowID is used to uniquely identify each row
      • Version is used to check whether the a given row has changed in the server
    • Conflict resolution
      • A conflict resolution procedure is invoked when a conflict is detected. The resolution procedure is created when the article is published
      • output can be server wins or handheld wins. Here the server always wins
  • 42. Row level tracking STEP 1 STEP 2 Row ID VER TID 0 1 0 IIT 0 CSE SERVER Row ID VER TID 0 1 0 IIT 0 CSE HANDHELD 1 Row ID VER TID 0 1 0 IIT 0 CSE SERVER Row ID VER TID 0 1 0 IIT 0 EE Handheld1 changes CSE to EE Row ID VER TID 0 1 0 IIT 0 CSE HANDHELD 2 Row ID VER TID 0 1 0 IIT 0 ME Handheld2 changes CSE to ME
  • 43. Row level tracking (cont) STEP 3 STEP 4 Row ID VER TID 0 1 0 IIT 1 EE SERVER merges with Handheld 1 Row ID VER TID 0 1 0 IIT 1 EE HANDHELD 1 Row ID VER TID 0 1 0 IIT 1 EE SERVER merges with Handheld 2 Row ID VER TID 0 1 0 IIT 1 EE Handheld1 Row ID VER TID 0 1 0 IIT 1 EE Handheld 2 Row ID VER TID 0 1 0 IIT 0 ME HANDHELD 2
  • 44. Current Implementation Status
    • Two Synchronization tools have been implemented for the Simputer
      • First Sync tool assumes that no updates are done in the handheld database
      • Second sync tool is based on Merge replication in Windows SQL CE. It allows independent updates in the handhelds.
  • 45. Conclusions
    • Handheld DBMS techniques have to consider the resource constraints, mobility, frequent disconnections, and security aspects of the handheld
    • The techniques used for one component will influence the choice of the technique used in another component. There is a very strong interdependence between the components of the handheld DBMS
    • Techniques rejected for the disk environment may be explored in the handheld environment
  • 46. Future work
    • Enhance the Sync tool
    • Transaction management component
    • Recovery management component
    • Concurrency control component
    • Performance analysis of existing compression techniques in handheld environment
  • 47. References
  • 48. References (cont)
  • 49. References (cont)
  • 50. References (cont)
  • 51.
    • Thank You
  • 52. Query Processing (cont)
    • Benefit/Size of a scheme
      • Every scheme is characterized by a benefit/size ratio which represents its benefit per unit memory allocation
      • Minimum scheme for an operator is the scheme that has max. cost and min. memory
      • Assume n schemes s 1 , s 2 ,…s n to implement an operator o
      • min(o)=s min
      • i, 1≤i≤n : Cost(s i ) ≤ Cost(s min ) ,
      • Memory(s i ) ≥ Memory(s min )
      • s min is the minimum scheme for operator o
      • Benefit(s i )=Cost(s min ) – Cost(s i )
      • Size(s i ) =Memory(s i ) – Memory(s min
  • 53. Query Processing (cont)
    • Every operator is a collection of (size, benefit) points, n points for n schemes
    • Operator cost function is the collection of (cost, memory) points of its schemes
    Benefit (0,0) (s1,b1) (s2,b2) Figure: (Size, Benefit) points for an operator Size Memory Cost (0,c1) (m2,c2) (m3,c3) (0,0) Figure: Operator cost function