An Overview of Flash Storage for Databases

2,944 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,944
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
70
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Overview of Flash Storage for Databases

  1. 1. An Overview of Flash Storage for Databases Morgan Tocker <morgan@percona.com> 1Wednesday, March 9, 2011
  2. 2. Introduction [ Me] [Percona] Director of Training. Previously Consulting, Training, worked at MySQL, Sun Support & Development Microsystems. for MySQL. ★ No invested interest in which hardware I recommend. ✦ [Disclaimer] Some hardware vendors have engaged in our services to evaluate and improve performance of their products. 2Wednesday, March 9, 2011
  3. 3. What this talk is about ★ Flash technologies (NAND, NOR). ★ Server Usage. ✦ Not USB thumb drives. ✦ Not Consumer usage. ★ “For Database” == MySQL. ✦ Should be more or less applicable for all databases. 3Wednesday, March 9, 2011
  4. 4. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 4Wednesday, March 9, 2011
  5. 5. Revolutionary ★ Change in technology - ✦ From spinning disk to solid state. ★ No mechanical moving parts. ★ Jump in performance. ★ Requires changes in the Application. ★ Hard not to predict a quick replacement to all SSDs in the next 5-10 years* * However, at the moment hard disks are still 5 becoming cheaper (size) quicker than SSDs!Wednesday, March 9, 2011
  6. 6. “Numbers everyone should know” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Zippy 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns NAND Flash (my estimate) 50,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns See: http://www.linux-mag.com/cache/7589/1.html and Google http:// 6 www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdfWednesday, March 9, 2011
  7. 7. Physics Behind ★ “Floating Gate Transistors” ✦ Non volatile memory. ★ One State - Single State (SLC) ✦ Faster, more reliable, expensive. ★ Many States - Multi Level Cell (MLC) ✦ Usually 4 states. ✦ Slower, less reliable, cheaper. 7Wednesday, March 9, 2011
  8. 8. Classification ★ NOR ✦ Speeds like memory for reads. ✦ Much, much slower for erase/writing data. ✦ Practical use: storing firmware. ★ NAND ✦ Faster writes. ✦ Only block-level read access (4K). ✦ Idea is to compact as many cells in limited space - to make it competitive with hard drives. 8Wednesday, March 9, 2011
  9. 9. Erasing (NAND) ★ Erase is to set all bits to “1111...” ✦ Erasing process is similar to “flash” in photocameras - this is where the name FLASH comes from. ✦ Erase is slow, done in batch operations (up to 1MB). ★ Change “1” -> “0” is fast. ★ Change “0” -> “1” is possible only be erase. ✦ 1st write: “1111” -> “1110”. Block marked as “written” ✦ 2nd write: even “1110” -> “1010” is not possible. 9Wednesday, March 9, 2011
  10. 10. Erase Challenges ★ Erase is slow ✦ You want to erase many blocks in a single “flash”. ✦ Block Management. ★ [via software] When you write, card never writes the same block. ★ Background process to run garbage collection. 10Wednesday, March 9, 2011
  11. 11. Erase Lifecycle ★ SLC ~100K times per cell (may vary). ★ MLC ~10K times per cell (may vary). ★ For many this is a major point of discussion. ✦ How big of an issue depends a lot on firmware. ✦ Many cells and even distribution (“wear levelling”) makes it a couple of years under heavy work load. 11Wednesday, March 9, 2011
  12. 12. Write degradation ★ Expected. ✦ More full the device, harder it is to garbage collect. ★ Graph for Fusion-io 320G MLC card: 12Wednesday, March 9, 2011
  13. 13. Firmware Really Matters (1) ★ I would not expect even less flat performance on a cheaper, non-enterprise class of hardware. ✦ Come to my talk on Friday. ✦ I will tell you consistency of performance is more important than anything else. 13Wednesday, March 9, 2011
  14. 14. Firmware Really Matters (2) ★ Many revisions of firmware for each vendor. ✦ Important to compare apples-to-apples in any comparisons. ✦ I heard a rumour one large SSD vendor is on their 4th successful complete ground up implementation ;) 14Wednesday, March 9, 2011
  15. 15. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 15Wednesday, March 9, 2011
  16. 16. The current market (1) ★ Fusion-IO. ✦ Established player with a large product line. ✦ Enjoyed near-monopoly for a while being only PCI card vendor. ★ Virident. ✦ Previously a MySQL Appliance vendor. ✦ Switched business model in ~2010 to just ship PCI Flash cards. ✦ Very good, consistent results. 16Wednesday, March 9, 2011
  17. 17. The current market (2) ★ Intel/OCZ/other. ✦ Typically aims for pro-desktop market. ✦ Does not necessarily offer the same features/promises as the “enterprise hardware”... 17Wednesday, March 9, 2011
  18. 18. You pay more for... ★ Greater amount of over provisioning (more consistent). ★ Internal redundancy (aka RAID). ★ More complex firmware (more consistent). ★ Guarantee of durability (such as a capacitor). ★ Greater life-span (more write cycles). ★ Better Performance (much more IOPS). 18Wednesday, March 9, 2011
  19. 19. Fusion-io 19Wednesday, March 9, 2011
  20. 20. Performance Specification ★ 160G SLC ✦ 110K read IOPS (4K) ✦ 26us read latency. ★ 320G MLC ✦ 71K read IOPS. ✦ 41us read latency. ★ “Duo” Range (not covered). ★ Lifetime: ✦ SLC flash @ 40% write duty | 25 calendar years ✦ MLC flash @ 20% write duty | 10 calendar years ✦ MLC flash @ 40% write duty | 5 calendar years 20Wednesday, March 9, 2011
  21. 21. Fusion-io Overview ★ Fast. Very fast. ✦ Cheaper than disks in terms of $-per IOPS. ★ PCI-E - closest to CPU. ★ Durability. ★ Shares host memory / CPU ★ Most complex part - firmware. ★ Large amount of space reservation for heavy writes. 21Wednesday, March 9, 2011
  22. 22. Fusion-io drawbacks ★ Expensive. Let’s say “$6000+” (retail; your price may be less). ✦ For full performance, requires additional 25% space reservation. ✦ DRAM is actually probably cheaper per GB. ★ PCI-E is not hot swap. ✦ Also has potential for errors (when host fails, garbage keeps being sent. Fusion-io handles this well.) 22Wednesday, March 9, 2011
  23. 23. Fusion-io durability ★ Cache is located on host system. ★ “Transaction log” to prevent lost data. ✦ Crash recovery. 23Wednesday, March 9, 2011
  24. 24. Fusion-io read performance 160GB SLC card 8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time RAID 10 is Dell Perc 6i on 8 disks 2.5” 15 RPM SAS 24Wednesday, March 9, 2011
  25. 25. Fusion-io write performance ★ 8 threads: 20K IOPS (314MB/sec), 0.26 ms 95% response time. 25Wednesday, March 9, 2011
  26. 26. Fusion-io databases ★ Many read / write threads to utilize throughput. ★ “MySQL” is not able to fully use it. ✦ Better in 5.5, MySQL-5.1-plugin, XtraDB. ★ InnoDB IO path “needs work”. 26Wednesday, March 9, 2011
  27. 27. Virident TachIOn 27Wednesday, March 9, 2011
  28. 28. Virident ★ PCI interface. ★ Has NAND flash upgrade modules. ★ Good stable results. ★ Advertised 300,000 IOPS in 75:25 (read:write). 28Wednesday, March 9, 2011
  29. 29. Virident Options ★ 300G, 400G, 600, 800G SLC cards. ✦ 400G is $13,600 ★ (More or less the same price range as Fusion-io). 29Wednesday, March 9, 2011
  30. 30. 2010 Benchmarks: http://www.mysqlperformanceblog.com/2010/06/15/virident- 30 tachion-new-player-on-flash-pci-e-cards-market/Wednesday, March 9, 2011
  31. 31. Intel SSDs 31Wednesday, March 9, 2011
  32. 32. Intel SSDs ★ Were awesome in 2008. ✦ Many accolades, first SSDs that probably made sense for a lot of pro-desktop users. ★ A couple of iterations of firmware, but mostly intel treated customers like mushrooms for 2 years. ✦ No clear advance warning of road map. ✦ Finally a replacement 510 series announced last month. • Slides don’t feature these. Have not used them. 32Wednesday, March 9, 2011
  33. 33. Intel Overview ★ SATA form factor. ★ Intel X25-M Gen 1 (50nm) & Gen 11 (35nm). ✦ MLC ★ Intel X25-E (50nm) ✦ SLC ✦ “Enterprise”. ★ New 510 series - just released last month. 33Wednesday, March 9, 2011
  34. 34. X25-E ★ 32G / 64G ★ Throughput: 35K IOPS reads, 3.5K IOPS writes. ★ Latency: 75us reads, 85us writes. ★ 64G - $725 ✦ $11/GB ★ Write endurance: ✦ 1 petabyte of random writes (32G) ✦ 2 petabytes of random writes (64G) 34Wednesday, March 9, 2011
  35. 35. X25-M Gen II ★ 80G / 160G ★ Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes. ★ Latency: 65us reads, 85us writes. ★ 160GB - $415 ✦ ~$3 / GB ★ Write Endurance. ✦ Not mentioned in official specification. 35Wednesday, March 9, 2011
  36. 36. X25-E and X25-M ★ Even if “E” is enterprise - power loss means data loss. ✦ Loss of transactions. ★ You can disable write cache, but performance is woeful. 36Wednesday, March 9, 2011
  37. 37. X25 Deployments ★ RAID ✦ Software / hardware? ✦ Level 0? 1? 10? 5? 50? ★ Engineering process could be complicated and expensive. ✦ There are/were ready solutions (Schooner[1], Gear6[2], Cisco servers). [1] Changed business model recently. 37 [2] Went broke.Wednesday, March 9, 2011
  38. 38. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 38Wednesday, March 9, 2011
  39. 39. MySQL Specific (1) ★ SSD is very good at Random reads. ✦ Not so good at sequential writes! ★ Data files on SSD. ✦ Table files (*.ibd). ✦ Rollback segments (ibdata1). ★ Logs on RAID with BBU. ✦ Binary logs. ✦ Transaction logs. ✦ Double write buffer. ✦ Insert buffer. ✦ Slow log, error log, general log. 39 See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.htmlWednesday, March 9, 2011
  40. 40. MySQL Specific (2) ★ Buy memory, or buy SSDs? ✦ [Usually] Buy memory when it’s possible. 40Wednesday, March 9, 2011
  41. 41. Other Reasons to use Flash (1) ★ Server Consolidation. ✦ Hard drives do ~100-200 IOPS* ✦ Now one card can get 100K (theorhetical)! ✦ ~x2 - x10 reduction in many cases (see craigslist). 41 * Assuming no RAID controller performing additional merging.Wednesday, March 9, 2011
  42. 42. Other Reasons to use Flash (2) ★ Power consumption reduction. ✦ “Transactions per watt” incredibly lower. • See: http://www.percona.com/files/percona-live/jeremy- Craigslist.pptx.pdf ✦ Important for a large number of people. Even if power is cheap, colo facilities often limit availability per-rack. 42Wednesday, March 9, 2011
  43. 43. Other Reasons to use Flash (3) ★ Limit variance / risk of operational issues from cold starts. ✦ Easy to see something like an advertising network miss response time goals when aim is 50ms/page. • Each IO is ~10ms. • Following a few secondary keys to a primary key and you miss it. ★ Good for throughput too. 43Wednesday, March 9, 2011
  44. 44. Applications must changeWednesday, March 9, 2011
  45. 45. Short Term (1) ★ Multi-threaded IO is required to exploit all throughput offered. ✦ InnoDB Plugin, MySQL 5.5 ready. ✦ Many other databases are not ready. 45Wednesday, March 9, 2011
  46. 46. Short Term (2) ★ Opportunities for Multi-level caches when data exceeds SSDs size. ✦ See Flashcache (Facebook), ZFS L2 ARC, Veritas. 46Wednesday, March 9, 2011
  47. 47. Long Term ★ Decades of hard drive assumptions about random IO cost need to be unwound. ✦ For example, InnoDB, Oracle, PostgreSQL work like this... 47Wednesday, March 9, 2011
  48. 48. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  49. 49. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  50. 50. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  51. 51. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  52. 52. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  53. 53. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  54. 54. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  55. 55. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  56. 56. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  57. 57. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  58. 58. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  59. 59. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  60. 60. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  61. 61. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  62. 62. Long Term (cont.) ★ Examples of “the database is the log” for MySQL are the PBXT and RethinkDB storage engines. 50Wednesday, March 9, 2011
  63. 63. Storage Hardware also changes ★ Most of us used to buying RAID controllers, placing disks below them. ✦ Only a very limited number of RAID controllers understand SSDS. ✦ RAID controllers are used to optimizing IO for devices capable of 100-200 IOPS. ✦ If we look at Fusion-IO, the devices also internally RAID (~RAID4). 51Wednesday, March 9, 2011
  64. 64. Technologies to look at ★ More PCI express cards. ✦ Potential to lower barrier to entry - only ~2-3 players, competition not as hot as it could be (yet). ★ More Enterprise focused MLC. ✦ Better software (firmware) means more wear levelling, improved performance, etc. ✦ More storage in fewer cells = lower cost. ★ Violin Memory ✦ I am not hands-on familiar with their technology, but they have some very high end offerings. ✦ Expect more awesome high end offerings (all vendors). 52Wednesday, March 9, 2011
  65. 65. Questions ★ Thank you for Confoo for letting me speak about such a niche topic! ★ If I’m out of time, please feel free to catch me around. 53Wednesday, March 9, 2011

×