Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing Oracle databases with SSD - April 2014

17,675 views

Published on

Presentation on using Solid State Disk (SSD) with Oracle databases, including the 11GR2 db flash cache and using flash in Exadata. Last given at Collaborate 2014 #clv14.

Published in: Technology
  • Be the first to comment

Optimizing Oracle databases with SSD - April 2014

  1. 1. 1 Global Marketing REMINDER Check in on the COLLABORATE mobile app 206:Using Flash SSD to Optimize Oracle Database Performance Guy Harrison Executive Director, R&D Information Management Group Dell Software
  2. 2. 2 Software Group Agenda • Brief History of Magnetic Disk • Solid State Disk (SSD) technologies • SSD internals • Oracle DB flash cache architecture • Performance comparisons • Exadata flash • Recommendations and Suggestions
  3. 3. 3 Software Group Introductions Web: guyharrison.net Email: guy.harrison@software.dell.com Twitter: @guyharrison Google Plus: https://www.google.com/+GuyHarrison1
  4. 4. 4 Software GroupConfidential
  5. 5. 5 Software GroupConfidential
  6. 6. 6 Software GroupConfidential
  7. 7. 7 Software GroupConfidential
  8. 8. 9
  9. 9. 10 Software Group A brief history of disk
  10. 10. 11 Software Group Magnetic Disk architecture
  11. 11. 12 Software Group 5MB HDD circa 1956
  12. 12. 13 Software Group 28MB HDD - 1961 • 1800 RPM • 100,000 times smaller than a cheap 3 TB drive • BUT spinning on 10 times slower than that drive
  13. 13. 14 Software Group The more that things change....
  14. 14. 15 Software Group Moore’s law
  15. 15. 16 Software Group Moore’s law • Transistor density doubles every 18 months • Exponential growth is observed in most electronic components: –CPU clock speeds –RAM –Hard Disk Drive storage density • But not in mechanical components –Service time (Seek latency) – limited by actuator arm speed and disk circumference –Throughput (rotational latency) – limited by speed of rotation, circumference and data density
  16. 16. 17 Software Group Disk trends 2001-2009 260 1,635 -630 1,013 -390 -1,000 -500 0 500 1,000 1,500 2,000 IO Rate Disk Capacity IO/Capacity CPU IO/CPU %agechange
  17. 17. 18 Software Group Solid State Disk to the rescue?
  18. 18. 19 Software Group Seek times 4,000 80 25 15 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 Magnetic Disk SSD SATA Flash SSD PCI flash SSD DDR-RAM Seek time (us)
  19. 19. 20 Software Group 1.27 0.50 0.05 0.12 0.06 0.04 0.83 2.93 13.41 26.83 0.00 5.00 10.00 15.00 20.00 25.00 30.00 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Capacity HDD Performance HDD SATA SSD MLC PCI SSD PCI SLC SSD Dollar/GB Dollar/IO Dollar/IOP Dollar/GB Economics of SSD
  20. 20. 21 Software Group Tiered storage management Main Memory DDR SSD Flash SSD Fast Disk (SAS, RAID 0+1) Slow Disk (SATA, RAID 5) Tape, Flat Files, Hadoop $/IOP $/GB
  21. 21. 22 Software Group SSD technology and internals
  22. 22. 23 Software Group Flavours of Solid State Disk • DDR RAM Drive • SATA flash drive • PCI flash drive • SSD storage Server
  23. 23. 24 Software Group PCI SSD vs SATA SSD • PCI vs SATA – SATA was designed for traditional disk drives with high latencies – PCI is designed for high speed devices – PCI SSD has latency ~ 1/3rd of SATA
  24. 24. 25 Software Group Dell Express flash • PCI flash performance can normally only be achieved by attaching a PCI card directly to the server motherboard • Dell express flash exposes the interfaces the PCI bus to front loading drive slots allowing hot swap and install of PCI flash
  25. 25. 26 Software Group Flash SSD is the most cost- effective SSD technology
  26. 26. Block 128K-1M Flash SSD internals • Cell: One (SLC), Two (MLC) or Three (TLC) bits • Page: Typically 4K • Block: Typically 128-512K Storage Hierarchy: • Read and first write require single page IO • Overwriting a page requires an erase & overwrite of the block Writes: • 100,000 erase cycles for SLC before failure • 5,000 – 15,000 erase cycles for MLC Write endurance: Page 4-8K Cell 1-2 bytes
  27. 27. 28 Software Group Flash SSD performance 25 250 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Read (4k page seek) First insert (4k page write) Update (256K block erase) Microseconds
  28. 28. 29 Software Group Flash Disk write degradation • All Blocks empty: Write time=250 us • 25% part full: – Write time= ( ¾ * 250 us + 1/4 * 2000 us) = 687 us • 75% part full – Write time = ( ¼ * 250 us + ¾ * 2000 us ) = 1562 us Empty Partially Full
  29. 29. 30 Software Group Valid Data Page Empty Data Page InValid Data Page Free Block Pool Used Block Pool SSD Controller Insert Data Insert
  30. 30. 31 Software Group Valid Data Page Empty Data Page Invalid Data Page Free Block Pool Used Block Pool SSD Controller Update Data Update
  31. 31. 32 Software Group Valid Data Page Empty Data Page Invalid Data Page Free Block Pool Used Block Pool SSD Controller Garbage Collection
  32. 32. 33 Software Group
  33. 33. 34 Software Group Oracle Database flash Cache
  34. 34. 35 Software Group Oracle DB flash cache • Introduced in 11gR2 for OEL and Solaris only • Secondary cache maintained by the DBWR, but only when idle cycles permit • Architecture is tolerant of poor flash write performance
  35. 35. 36 Software Group Database files Buffer cache DBWR Oracle process Free Buffer Waits Write dirty blocks to disk Write to buffer cache Read from disk Read from buffer cache Free buffer waits often occur when reads are much faster than writes.... Buffer cache and Free buffer waits
  36. 36. 37 Software Group Database files Buffer cache DBWR Oracle process Write dirty blocks to disk Write to buffer cache Read from disk Read from buffer cache Flash Cache Write clean blocks (time permitting) Read from flash cache DB Flash cache architecture is designed to accelerate buffered reads Flash Cache
  37. 37. 38 Software Group Configuration • Create filesystem from flash device • Set DB_FLASH_CACHE_FILE and DB_FLASH_CACHE_SIZE. • Consider Filesystemio_options=setall
  38. 38. 39 Software Group Flash KEEP pool • You can prioritise blocks for important objects using the FLASH_CACHE clause:
  39. 39. 40 Software Group Oracle Db flash cache statistics http://guyharrison.squarespace.com/storage/flash_insert_stats.sql
  40. 40. 41 Software Group Flash Cache Efficiency http://guyharrison.squarespace.com/storage/flash_time_savings.sql
  41. 41. 42 Software Group Flash cache Contents http://guyharrison.squarespace.com/storage/flashContents.sql
  42. 42. 43 Software Group Performance tests
  43. 43. 44 Software Group Test systems • Third System: – Oracle Exadata X-2 ¼ rack – 36 × 600 GB 15K RPM SAS HDD – 12 x 96GB Sun F20 SLC PCI flash cards. • Final System: – Dell R720 2x8 core 2.7GHz processors, 64 GB RAM – 16x15K HDD in RAID 10 – 1x Dell Express Flash SLC PCIe • First System: – Dell Optiplex dual-core 4GB RAM – 2xSeagate 7500RPM Baracuda SATA HDD – Intel X-25E SLC SATA SSD • Second System: – Dell R510 2xquad core, 32 GB RAM – 4x300GB 15K RPM,6Gbps Dell SAS HDD – 1xFusionIO ioDrive SLC PCI SSD
  44. 44. 45 Software Group Performance: indexed reads(X-25) 529.7 143.27 48.17 0 100 200 300 400 500 600 No Flash Flash cache Flash tablespace Elapsed (s) CPU db file IO flash cache IO Other
  45. 45. 46 Software Group Performance: Read/Write (X-25) 3,289 1,693 200 0 500 1000 1500 2000 2500 3000 3500 No Flash Flash Cache Flash tablespace Elapsed time (s) CPU db file IO write complete free buffer flash cache IO Other
  46. 46. 47 Software Group Random reads – FusionIO 2,211 583 121 0 500 1000 1500 2000 2500 SAS disk, no flash cache SAS disk, flash cache Table on SSD Elapsed time (s) CPU Other DB File IO Flash cache IO
  47. 47. 48 Software Group Updates – Fusion IO 6,219 1,934 529 0 1000 2000 3000 4000 5000 6000 7000 SAS disk, no flash cache SAS disk, flash cache Table on SSD Elapsed Time (s) DB CPU db file IO log file IO flash cache free buffer waits Other
  48. 48. 49 Software Group Buffer Cache bottlenecks • Flash cache architecture avoids ‘free buffer waits’ due to waits flash IO, but write complete waits can still occur on hot blocks. • Free buffer waits are still possible against the database files, because flash cache accelerates reads but not writes
  49. 49. 50 Software Group Full table scans 418 398 72 0 50 100 150 200 250 300 350 400 450 SAS disk, no flash cache SAS disk, flash cache Table on SSD Elasped time (s) CPU Other DB File IO Flash Cache IO Flash cache doesn’t accelerate Full table scans b/c scans use direct path reads and flash cache only accelerates buffered reads
  50. 50. 51 Software Group Sorting – what we expect Time PGA Memory available (MB) Table/Index IO CPU Time Temp Segment IO Memory Sort Single Pass Disk Sort Multi-pass Disk Sort
  51. 51. 52 Software Group Disk Sorts – temp tablespace SSD vs HDD 0 500 1000 1500 2000 2500 3000 3500 4000 050100150200250300 Elapsedtime(s) Sort Area Size SAS based TTS SSD based TTS Single Pass Disk Sort Multi-pass Disk Sort
  52. 52. 53 Software Group SSD for Redo?
  53. 53. 54 Software Group 292.39 291.93 0 50 100 150 200 250 300 350 SAS based redo log Flash based redo log Elapsed time (s) CPU Log IO Redo performance – Fusion IO
  54. 54. 55 Software Group Concurrent redo workload (x10) 55 1,605 1,637 397 331 1,944 1,681 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 SAS based redo log Flash based redo log Elapsed time (s) CPU Other Log File IO
  55. 55. 57 Software Group Redo logs - redo size • Marcelle Kratochvil has reported significant improvements for SSD redo when applying LOB updates • Performance for SSD writing small OLTP style transactions may differ significantly from large LOB updates: – Small transactions will hit the same block repeatedly, resulting in block erase overheads for most writes. – When the redo size exceeds the SSD page size then this overhead is avoided and redo performance on SSD may exceed HDD – On the other hand ““in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.”” (flashdba)
  56. 56. 58 Software Group Redo performance – Express Flash 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 1 2 3 4 5 6 Redo Size MB Millions HDD SSD Block erase Required Block erase not Required
  57. 57. 59 Software Group Conclusions for redo • SSD is not a good match for redo – Sustained sequential writes lead to heavy garbage collection overhead – Magnetic disk is very good as sequential writes because seek time is minimized • Very good SSD might provide (very roughly) a 20-30% reduction in redo log sync waits – At least, that is the best I have seen – Might provide no benefit at all on a busy system – Might provide higher benefits on a lightly burdened system • Very eager to compare data with anyone who has different results
  58. 58. 60 Software Group Device level SSD caches
  59. 59. 61 Software Group Flash caching technologies 61 Dell FluidCache, FusionIO DirectCache, etc. Read- intensive, po tentially massive tablespaces •Temp Tablespace • Hot Segments • Hot Partitions • DB Flash Cache (limited to the size of the SSD) Regular Block Device Device Driver File System/ Raw Devices/ ASM FluidCache Driver File System/ Raw Devices/ ASM Caching Block Device LUN
  60. 60. 62 Software Group Fusion IO direct cache – Table scans 147 147 147 36 0 20 40 60 80 100 120 140 160 No cache 1st scan No cache 2nd scan direct cache on 1st scan direct cache on 2nd scan Elapsed time (s) CPU IO Other
  61. 61. 63 Software Group Exadata 63
  62. 62. 64 Software Group Exadata X-4
  63. 63. 65 Software Group Exadata flash storage • 4x96GB PCI Flash drives on each storage server (4x increase in X3) • Flash can be configured as: – Exadata Smart Flash Cache (ESFC) – Solid State Disk available to ASM disk groups • ESFC is not the same as the DB flash cache: – Maintained by cellsrv, not DBWR – Supports smart scans and full scans – If CELL_FLASH_CACHE= KEEP, – Statistics accessed via the cellcli program • Considerations for cache vs. SSD are similar
  64. 64. 66 Software Group Exadata Smart Flash Cache Architecture Database Node Database Node Storage Node 1 cellsrv Flash Cache Grid Disks Oracle process Buffer Cache Oracle process Buffer Cache 3 2 4 4 5 6 2
  65. 65. 67 Software Group CELL_FLASH_CACHE_KEEP • CELL_FLASH_CACHE_KEEP applies at the segment (table, index, partition) level • Default setting caches smart scan and index lookup results. Full table scans are only cached when the KEEP option is applied CELL_FLASH_CA CHE_KEEP Index lookups Smart Scans Full Table scans (not smar) NONE Not cached Not cached Not cached DEFAULT Cached Not Cached Not cached KEEP Cached Cached Cached
  66. 66. 68 Software Group Using Exadata flash as grid disk • Exadata uses all flash disks as flash cache • You can modify this configuration and assign flash disks as grid disks ASM Disk Group ASM Disk Group Cell Disks SAS Disks Grid Disks Flash Disks Flash Cache ASM Disk Group ASM Disk Group Cell Disks SAS Disks ASM Disk Group Grid Disks Flash Disks Flash Cache
  67. 67. 69 Software Group Index reads 9.39 21.64 31.74 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 SSD tablespace, no cache HDD tablespace, default cache HDD tablespace, no cache Time (s) CPU Time IO Time
  68. 68. 70 Software Group Full Table scans 2.94 4.75 11.27 3.36 33.14 12.45 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 SSD table, default cache HDD table, keep cache HDD table, default cache Time (s) 1st scan 2nd scan Beware of CELL_FLASH_CACHE=KEEP
  69. 69. 71 Software Group Exadata: SSD for redo
  70. 70. 72 Software Group
  71. 71. 73 Software Group Note: Exadata X-2 performance: X-3 and X-4 are probably much faster
  72. 72. 74 Software Group Exadata Smart FlashLog
  73. 73. 75 Software Group Smart Flash Log • Designed to reduce “outlier” redo log sync waits • Redo is written simultaneously to disk and flash • First write to complete wins • Introduced in Exadata storage software 11.2.2.4 Database Node Storage Node 1 cellsrv Flash Cache Grid Disks Oracle processLog Buffer 4 2 5 LGWR 3 4
  74. 74. 76 Software Group All Redo log writes (16M log writes) Flash Log Min Median Mean 99% Max ON 1.0 650 723 1,656 75,740 OFF 1.0 627 878 4,662 291,800
  75. 75. 77 Software Group Redo log outliers WAIT #47124064145648: nam='log file sync' ela= 710 buffer#=129938 sync scn=1266588258 p3=0 obj#=-1 tim=1347583167579790 WAIT #47124064145648: nam='log file sync' ela= 733 buffer#=130039 sync scn=1266588297 p3=0 obj#=-1 tim=1347583167580808 WAIT #47124064145648: nam='log file sync' ela= 621 buffer#=130124 sync scn=1266588332 p3=0 obj#=-1 tim=1347583167581695 WAIT #47124064145648: nam='log file sync' ela= 507 buffer#=130231 sync scn=1266588371 p3=0 obj#=-1 tim=1347583167582486 WAIT #47124064145648: nam='log file sync' ela= 683 buffer#=101549 sync scn=1266588404 p3=0 obj#=-1 tim=1347583167583398 WAIT #47124064145648: nam='log file sync' ela= 2084 buffer#=130410 sync scn=1266588442 p3=0 obj#=-1 tim=1347583167585748 WAIT #47124064145648: nam='log file sync' ela= 798 buffer#=130535 sync scn=1266588488 p3=0 obj#=-1 tim=1347583167586864 WAIT #47124064145648: nam='log file sync' ela= 1043 buffer#=101808 sync scn=1266588527 p3=0 obj#=-1 tim=1347583167588250 WAIT #47124064145648: nam='log file sync' ela= 2394 buffer#=130714 sync scn=1266588560 p3=0 obj#=-1 tim=1347583167590888 WAIT #47124064145648: nam='log file sync' ela= 932 buffer#=101989 sync scn=1266588598 p3=0 obj#=-1 tim=1347583167592057 WAIT #47124064145648: nam='log file sync' ela= 291780 buffer#=102074 sync scn=1266588637 p3=0 obj#=-1 tim=1347583167884090 WAIT #47124064145648: nam='log file sync' ela= 671 buffer#=102196 sync scn=1266588697 p3=0 obj#=-1 tim=1347583167885294 WAIT #47124064145648: nam='log file sync' ela= 957 buffer#=102294 sync scn=1266588730 p3=0 obj#=-1 tim=1347583167886575 WAIT #47124064145648: nam='log file sync' ela= 852 buffer#=120 sync scn=1266588778 p3=0 obj#=-1 tim=1347583167887763 WAIT #47124064145648: nam='log file sync' ela= 639 buffer#=214 sync scn=1266588826 p3=0 obj#=-1 tim=1347583167888778 WAIT #47124064145648: nam='log file sync' ela= 699 buffer#=300 sync scn=1266588853 p3=0 obj#=-1 tim=1347583167889767 WAIT #47124064145648: nam='log file sync' ela= 819 buffer#=102647 sync scn=1266588886 p3=0 obj#=-1 tim=1347583167890829
  76. 76. 78 Software Group Top 10,000 waits
  77. 77. 79 Software Group Exadata 12c Smart Flash Cache Write-back • Database writes go to flash cache – LRU aging to HDD – Reads serviced by flash prior to age out – Similar restrictions to flash cache (smart scans, etc) – Will be most effective when “buffer waits” exist – random IO writes are less problematic for flash than sequential writes.
  78. 78. 80 Software Group Performance tests 1,917.34 7,693.62 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 Write Back Write Through seconds FlashCacheMode CPU Time Other Wait Time Free Buffer Waits Buffer Busy Watis
  79. 79. 81 Software Group Summary
  80. 80. 82 Software Group Recommendations • Don’t wait for SSD to become as cheap as HDD – Magnetic HDD will always be cheaper per GB, SSD cheaper per IO • Consider a mixed or tiered storage strategy – Using DB flash cache, selective SSD tablespaces or partitions – Use SSD where your IO bottleneck is greatest and SSD advantage is significant • DB flash cache offers an easy way to leverage SSD for OLTP workloads, but has few advantages for OLAP or Data Warehouse
  81. 81. 83 Software Group How to use SSD • Database flash cache – If your bottleneck is single block (indexed reads) and you are on OEL or Solaris 11GR2 • Flash tablespace – Optimize read/writes against “hot” segments or partitions • Flash temp tablespace – If multi-pass disk sorts or hash joins are your bottleneck • Device cache (Dell FluidCache, FusionIO direct cache) – If you want to optimize both scans and index reads OR you are not on OEL/Solaris 11GR2 • Exadata uses Flash effectively for read AND write optimization – Consider allocating some of Exadata Flash as ASM tablespace for hot tables and segments
  82. 82. 84 Software Group Visit the Dell Software Booth Enter for a chance to win a Dell Venue Pro 11 tablet Draw is at 2:45pm Thursday
  83. 83. Please complete the session evaluation on the mobile app We appreciate your feedback and insight guy.harrison@software.dell.com @guyharrison Guyharrison.net

×