ACM SIGMOD 2007, Beijing, China -1 -


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ACM SIGMOD 2007, Beijing, China -1 -

  1. 1. Design of Flash-Based DBMS: An In-Page Logging Approach Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. [email_address] COMPUTER SCIENCE DEPARTMENT Sang-Won Lee School of Info & Comm Eng Sungkyunkwan University Suwon, Korea 440-746 [email_address] SIGMOD’07
  2. 2. Outline <ul><li>Flash memory </li></ul><ul><li>Disk-Based DBMS on Flash Memory </li></ul><ul><li>Flash-Based DBMS: In-Paging Logging approach </li></ul><ul><li>Reviews </li></ul>COMPUTER SCIENCE DEPARTMENT
  3. 3. Flash Memory <ul><li>Flash memory is a type of electrically-erasable programmable read-only memory (EEPROM) </li></ul><ul><li>Page is the unit of read and write operations </li></ul><ul><ul><li>Typical value: 2KB </li></ul></ul><ul><li>Write operation can only clear bits (change their value from 1 to 0). </li></ul><ul><li>The only way to change value from 0 to 1 is erase an entire region memory. </li></ul><ul><ul><li>This region has fixed-size, called erase units, erase block or just block. </li></ul></ul><ul><ul><li>Typical value: 128KB for large flash memory </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  4. 4. Characteristics of Flash <ul><li>No in-place update </li></ul><ul><ul><li>the data item need to be erased first before writing it again. </li></ul></ul><ul><ul><li>An erase unit (16KB or 128 KB) is much larger than a sector (512 bytes). </li></ul></ul><ul><li>No mechanical latency </li></ul><ul><ul><li>Flash memory is an electronic device without moving parts </li></ul></ul><ul><ul><li>Provides uniform random access speed without seek/rotational latency </li></ul></ul><ul><li>Asymmetric read & write speed </li></ul><ul><ul><li>Read speed is typically at least twice faster than write speed </li></ul></ul><ul><ul><li>Write (and erase) optimization is critical </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  5. 5. Magnetic Disk vs Flash Memory <ul><ul><li>Magnetic Disk : Seagate Barracuda 7200.7 ST380011A </li></ul></ul><ul><ul><li>NAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NAND </li></ul></ul><ul><ul><li>Unit of read/write: 2KB, Unit of erase: 128KB </li></ul></ul>COMPUTER SCIENCE DEPARTMENT Read time Write time Erase time Magnetic Disk 12.7 msec 13.7 msec N/A NAND Flash 80  sec 200  sec 1.5 msec
  6. 6. Category of Flash Memory <ul><li>NAND Vs. NOR Flash </li></ul><ul><ul><li>NOR: high erase cost (several seconds), directly addressable (access is by bit or byte) </li></ul></ul><ul><ul><li>NAND: relative low erase cost (several ms), access is by pages </li></ul></ul><ul><li>MLC Vs. SLC NAND Flash </li></ul><ul><ul><li>MLC (Multiple Level Cell): it stores multiple bits per cell, but significantly slower read and write speeds; 10x lower read/write lifetime </li></ul></ul><ul><ul><li>SLC (Single Level Cell): it stores only one single bit per cell SLC flash has much better performance, lifetime, and reliability properties than MLC </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  7. 7. Small flash Vs Large flash <ul><li>Small flash memory was widely used for PDA, MP3, mobile phone, sensor network… </li></ul><ul><ul><li>Advantages: size, weight, shock resistance, power consumption, noise … </li></ul></ul><ul><ul><li>Typical size: a few gigabytes </li></ul></ul><ul><li>Recently, some vendors developed large flash memory called Flash SSD (Solid State Disk) </li></ul><ul><ul><li>Mainly used for notebook PC. Apple AirBook / Thinkpad X300 </li></ul></ul><ul><ul><li>Typical size: > 16G </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  8. 8. NAND flash system architecture COMPUTER SCIENCE DEPARTMENT <ul><li>Flash Translation Layer (FTL): software layer to make NAND flash fully emulate magnetic disks. </li></ul><ul><ul><li>Logical-to-physical mapping </li></ul></ul><ul><ul><li>Garbage collection </li></ul></ul><ul><ul><li>Power-off recovery </li></ul></ul><ul><ul><li>Wear-leveling </li></ul></ul><ul><ul><li>Bad block management </li></ul></ul><ul><ul><li>Error correction code (ECC) </li></ul></ul><ul><ul><li>Power management </li></ul></ul>
  9. 9. Different FTLs for Large and Small Flash Memory <ul><li>Page-mapping FTL (used for small flash memory) </li></ul><ul><ul><li>maintains the mapping information between the logical page and the physical page separately </li></ul></ul><ul><ul><li>Log-structured achitecture </li></ul></ul><ul><ul><li>Large memory for its mapping information </li></ul></ul><ul><ul><li>must be reconstructed by scanning the whole flash memory at start-up, and this may result in long mount time </li></ul></ul><ul><li>Block-mapping FTL </li></ul><ul><ul><li>Small memory for its mapping information </li></ul></ul><ul><ul><li>Any update causes a whole block rewrite (that is why random writes are so slow!) </li></ul></ul><ul><ul><li>In real production, there are some optimizations for improving concentrated updates </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  10. 10. Flash memory for server application <ul><ul><li>More recently, because of the advantages of flash memory and the increasing capacity, there is a new trend that use large flash memory for database server application </li></ul></ul><ul><li>Jim Gray said: </li></ul><ul><ul><li>Tape is Dead, Disk is Tape, Flash is Disk! </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  11. 11. Outline <ul><li>Flash memory </li></ul><ul><li>Disk-Based DBMS on Flash Memory </li></ul><ul><li>Flash-Based DBMS: In-Paging Logging approach </li></ul><ul><li>My reviews </li></ul>COMPUTER SCIENCE DEPARTMENT
  12. 12. Disk-Based DBMS on Flash Memory <ul><li>What happens if disk-based DBMS runs on Flash memory? </li></ul><ul><ul><li>Due to No In-place Update, it writes the whole block into another clean block </li></ul></ul><ul><ul><li>Consume free blocks quickly causing frequent garbage collection and erase </li></ul></ul>COMPUTER SCIENCE DEPARTMENT Flash Memory Page : 4KB SQL: Update / Insert / Delete Buffer Mgr. Data Block Area Dirty Block Write Erase Unit: 128KB Update
  13. 13. Disk-Based DBMS Performance <ul><li>Run SQL queries on a commercial DBMS </li></ul><ul><ul><li>Sequential scan or update of a table </li></ul></ul><ul><ul><li>Non-sequential read or update of a table (via B-tree index) </li></ul></ul><ul><li>Experimental settings </li></ul><ul><ul><li>Storage: Magnetic disk vs M-Tron SSD (Samsung flash chip) </li></ul></ul><ul><ul><li>Data page of 8KB </li></ul></ul><ul><ul><li>10 tuples per page, 640,000 tuples in a table (64,000 pages, 512MB) </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  14. 14. Disk-Based DBMS Performance <ul><li>Read performance : The result is not surprising at all </li></ul>COMPUTER SCIENCE DEPARTMENT <ul><li>Hard disk </li></ul><ul><ul><li>Read performance is poor for non-sequential accesses, mainly because of seek and rotational latency </li></ul></ul><ul><li>Flash memory </li></ul><ul><ul><li>Read performance is insensitive to access patterns </li></ul></ul>Disk Flash Sequential 14.0 sec 11.0 sec Non-sequential 61.1 ~ 172.0 sec 12.1 ~ 13.1 sec
  15. 15. Disk-Based DBMS Performance <ul><li>Write performance </li></ul>COMPUTER SCIENCE DEPARTMENT <ul><li>Hard disk </li></ul><ul><ul><li>Write performance is poor for non-sequential accesses, mainly because of seek and rotational latency </li></ul></ul><ul><li>Flash memory </li></ul><ul><ul><li>Write performance is poor (worse than disk) for non-sequential accesses due to out-of-place update and erase operations </li></ul></ul><ul><ul><li>Demonstrate the need of write optimization for DBMS running on Flash </li></ul></ul>Disk Flash Sequential 34.0 sec 26.0 sec Non-sequential 151.9 ~ 340.7 sec 61.8 ~ 369.9 sec
  16. 16. Outline <ul><li>Flash memory </li></ul><ul><li>Disk-Based DBMS on Flash Memory </li></ul><ul><li>Flash-Based DBMS: In-Paging Logging approach </li></ul><ul><li>My reviews </li></ul>COMPUTER SCIENCE DEPARTMENT
  17. 17. In-Page Logging (IPL) Approach <ul><li>Design Principles </li></ul><ul><ul><li>Take advantage of the characteristics of flash memory </li></ul></ul><ul><ul><ul><li>Fast read speed </li></ul></ul></ul><ul><ul><li>Overcome the “erase-before-write” limitation of flash memory </li></ul></ul><ul><ul><li>Minimize the changes to the DBMS architecture </li></ul></ul><ul><ul><ul><li>Limited to buffer manager and storage manager </li></ul></ul></ul>COMPUTER SCIENCE DEPARTMENT
  18. 18. Design of the IPL <ul><li>Logging on Per-Page basis in both Memory and Flash </li></ul>COMPUTER SCIENCE DEPARTMENT <ul><li>An In-memory log sector can be associated with a buffer frame in memory </li></ul><ul><ul><li>Allocated on demand when a page becomes dirty </li></ul></ul><ul><li>An In-flash log segment is allocated in each erase unit </li></ul>The log area is shared by all the data pages in an erase unit Flash Memory Database Buffer in-memory data page (8KB) update-in-place in-memory log sector (512B) log area (8KB): 16 sectors Erase unit: 128KB 15 data pages (8KB each) … . … .
  19. 19. IPL Write COMPUTER SCIENCE DEPARTMENT Buffer Mgr. Flash Memory Update / Insert / Delete Data Block Area update-in-place physiological log Page : 8KB Sector : 512B Block : 128KB <ul><li>Whenever an update is performed on a data page, the in-memory copy of data page is updated immediately . In addition, IPL buffer manager adds a log record to the in-memory log sector </li></ul><ul><li>When a dirty page is evicted by replacement policy or the in-memory log sector is full , </li></ul><ul><li>the content of data page is not written to flash memory. </li></ul><ul><li>Instead, In-memory log sector is written to the in-flash log segment </li></ul>
  20. 20. IPL Read <ul><li>When a page is read from flash, the current version is computed on the fly </li></ul>COMPUTER SCIENCE DEPARTMENT Buffer Mgr. Apply the “physiological action” to the copy read from Flash (CPU overhead) Flash Memory <ul><li>Read from Flash </li></ul><ul><li>Original copy of P i </li></ul><ul><li>All log records belonging to P i (IO overhead) </li></ul>Re-construct the current in-memory copy P i log area (8KB): 16 sectors data area (120KB): 15 pages
  21. 21. IPL Merge <ul><li>When all free log sectors in an erase unit are consumed </li></ul><ul><ul><li>Log records are applied to the corresponding data pages </li></ul></ul><ul><ul><li>The current data pages are copied into a new erase unit </li></ul></ul>COMPUTER SCIENCE DEPARTMENT A Physical Flash Block log area (8KB): 16 sectors B old B new clean log area 15 up-to-date data pages Merge
  22. 22. Why IPL can improve write performance of DBMS? <ul><li>The number of disk writes doesn’t decrease </li></ul><ul><ul><li>Actually, #writes may increase because: </li></ul></ul><ul><ul><li>It introduces excess disk writes if the log sector is full </li></ul></ul><ul><ul><li>The merge operation introduces overhead </li></ul></ul><ul><li>Then why can IPL improve write performance? </li></ul><ul><ul><li>IPL overcomes the erase-before-write property of flash </li></ul></ul><ul><ul><li>Reduces the number of erasures </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  23. 23. IPL Simulation with TPC-C <ul><li>TPC-C Log Data Generation </li></ul><ul><ul><li>Run a commercial DBMS to generate reference streams of TPC-C benchmark </li></ul></ul><ul><ul><ul><li>HammerOra utility used for TPC-C workload generation </li></ul></ul></ul><ul><ul><li>Each trace contains log records of physiological updates as well as physical page writes </li></ul></ul><ul><ul><li>Average length of a log record: 20 ~ 50B </li></ul></ul><ul><li>TPC-C Traces </li></ul><ul><ul><li>100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users </li></ul></ul><ul><ul><li>1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users </li></ul></ul><ul><ul><li>1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users </li></ul></ul><ul><li>Parameter setting </li></ul><ul><ul><li>Write (2KB): 200 us </li></ul></ul><ul><ul><li>Merge (128KB): 20 ms </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  24. 24. Log Segment Size vs Merges <ul><li>TPC-C Write frequencies are highly skewed (and low temporal locality) </li></ul><ul><li>Erase units containing hot pages consume log sectors quickly </li></ul><ul><ul><li>Could cause a large number of erase operations </li></ul></ul><ul><ul><li>More storage but less frequent merges with more log sectors </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  25. 25. Estimated Write Performance <ul><li>Performance trend with varying buffer sizes </li></ul><ul><ul><li>The size of log segment was fixed at 8KB </li></ul></ul><ul><li>Estimated write time </li></ul><ul><ul><li>With IPL = (# of sector writes) × 200us + (# of merges) × 20ms </li></ul></ul><ul><ul><li>Without IPL =  × (# of page writes) × 20ms </li></ul></ul><ul><ul><ul><li> is the probability that a page write causes erase operation </li></ul></ul></ul>COMPUTER SCIENCE DEPARTMENT
  26. 26. Support for Recovery <ul><li>IPL helps realize a lean recovery mechanism </li></ul><ul><ul><li>Additional logging: transaction log and list of dirty pages </li></ul></ul><ul><li>Transaction Commit </li></ul><ul><ul><li>Similarly to flushing log tail </li></ul></ul><ul><ul><li>An in-memory log sector is forced out to flash if it contains at least one log record of a committing transaction </li></ul></ul><ul><ul><li>No explicit REDO action required at system restart </li></ul></ul><ul><li>Transaction Abort </li></ul><ul><ul><li>De-apply the log records of an aborting transaction </li></ul></ul><ul><ul><li>Use selective merge instead of regular merge, because it’s irreversible </li></ul></ul><ul><ul><ul><li>If committed, merge the log record </li></ul></ul></ul><ul><ul><ul><li>If aborted, discard the log record </li></ul></ul></ul><ul><ul><ul><li>If active, carry over the log record to a new erase unit </li></ul></ul></ul><ul><ul><li>To avoid a thrashing behavior, allow an erase unit to have overflow log sectors </li></ul></ul><ul><ul><li>No explicit UNDO action required </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  27. 27. Conclusion <ul><li>Clear and present evidence that Flash can replace Disk </li></ul><ul><li>IPL approach demonstrates its potential for TPC-C type database applications by </li></ul><ul><ul><li>Overcoming the “erase-before-write” limitation </li></ul></ul><ul><ul><li>Exploiting the fast and uniform random access </li></ul></ul><ul><li>IPL also helps realize a lean recovery mechanism </li></ul>COMPUTER SCIENCE DEPARTMENT
  28. 28. Outline <ul><li>Flash memory </li></ul><ul><li>Disk-Based DBMS on Flash Memory </li></ul><ul><li>Flash-Based DBMS: In-Paging Logging approach </li></ul><ul><li>Reviews </li></ul>COMPUTER SCIENCE DEPARTMENT
  29. 29. Reviews <ul><li>IPL hurts read performance </li></ul><ul><ul><li>For each read operation, it has to read data page and log sector page </li></ul></ul><ul><ul><li>Read performance will be about 2X slower </li></ul></ul><ul><li>No Experiment Result </li></ul><ul><ul><li>The authors only give the result through the I/O access simulation </li></ul></ul><ul><li>Simulation </li></ul><ul><ul><li>The data size of simulation is too small (1G). </li></ul></ul><ul><ul><li>Didn’t show the overall performance of TPC-C. (most operations in TPC-C are read operations) </li></ul></ul>COMPUTER SCIENCE DEPARTMENT
  30. 30. Any Questions? <ul><li>Q & A </li></ul>COMPUTER SCIENCE DEPARTMENT