• Like
An Overview of Flash Storage for Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

An Overview of Flash Storage for Databases

  • 2,281 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,281
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
69
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. An Overview of Flash Storage for Databases Morgan Tocker <morgan@percona.com> 1Wednesday, March 9, 2011
  • 2. Introduction [ Me] [Percona] Director of Training. Previously Consulting, Training, worked at MySQL, Sun Support & Development Microsystems. for MySQL. ★ No invested interest in which hardware I recommend. ✦ [Disclaimer] Some hardware vendors have engaged in our services to evaluate and improve performance of their products. 2Wednesday, March 9, 2011
  • 3. What this talk is about ★ Flash technologies (NAND, NOR). ★ Server Usage. ✦ Not USB thumb drives. ✦ Not Consumer usage. ★ “For Database” == MySQL. ✦ Should be more or less applicable for all databases. 3Wednesday, March 9, 2011
  • 4. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 4Wednesday, March 9, 2011
  • 5. Revolutionary ★ Change in technology - ✦ From spinning disk to solid state. ★ No mechanical moving parts. ★ Jump in performance. ★ Requires changes in the Application. ★ Hard not to predict a quick replacement to all SSDs in the next 5-10 years* * However, at the moment hard disks are still 5 becoming cheaper (size) quicker than SSDs!Wednesday, March 9, 2011
  • 6. “Numbers everyone should know” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Zippy 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns NAND Flash (my estimate) 50,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns See: http://www.linux-mag.com/cache/7589/1.html and Google http:// 6 www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdfWednesday, March 9, 2011
  • 7. Physics Behind ★ “Floating Gate Transistors” ✦ Non volatile memory. ★ One State - Single State (SLC) ✦ Faster, more reliable, expensive. ★ Many States - Multi Level Cell (MLC) ✦ Usually 4 states. ✦ Slower, less reliable, cheaper. 7Wednesday, March 9, 2011
  • 8. Classification ★ NOR ✦ Speeds like memory for reads. ✦ Much, much slower for erase/writing data. ✦ Practical use: storing firmware. ★ NAND ✦ Faster writes. ✦ Only block-level read access (4K). ✦ Idea is to compact as many cells in limited space - to make it competitive with hard drives. 8Wednesday, March 9, 2011
  • 9. Erasing (NAND) ★ Erase is to set all bits to “1111...” ✦ Erasing process is similar to “flash” in photocameras - this is where the name FLASH comes from. ✦ Erase is slow, done in batch operations (up to 1MB). ★ Change “1” -> “0” is fast. ★ Change “0” -> “1” is possible only be erase. ✦ 1st write: “1111” -> “1110”. Block marked as “written” ✦ 2nd write: even “1110” -> “1010” is not possible. 9Wednesday, March 9, 2011
  • 10. Erase Challenges ★ Erase is slow ✦ You want to erase many blocks in a single “flash”. ✦ Block Management. ★ [via software] When you write, card never writes the same block. ★ Background process to run garbage collection. 10Wednesday, March 9, 2011
  • 11. Erase Lifecycle ★ SLC ~100K times per cell (may vary). ★ MLC ~10K times per cell (may vary). ★ For many this is a major point of discussion. ✦ How big of an issue depends a lot on firmware. ✦ Many cells and even distribution (“wear levelling”) makes it a couple of years under heavy work load. 11Wednesday, March 9, 2011
  • 12. Write degradation ★ Expected. ✦ More full the device, harder it is to garbage collect. ★ Graph for Fusion-io 320G MLC card: 12Wednesday, March 9, 2011
  • 13. Firmware Really Matters (1) ★ I would not expect even less flat performance on a cheaper, non-enterprise class of hardware. ✦ Come to my talk on Friday. ✦ I will tell you consistency of performance is more important than anything else. 13Wednesday, March 9, 2011
  • 14. Firmware Really Matters (2) ★ Many revisions of firmware for each vendor. ✦ Important to compare apples-to-apples in any comparisons. ✦ I heard a rumour one large SSD vendor is on their 4th successful complete ground up implementation ;) 14Wednesday, March 9, 2011
  • 15. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 15Wednesday, March 9, 2011
  • 16. The current market (1) ★ Fusion-IO. ✦ Established player with a large product line. ✦ Enjoyed near-monopoly for a while being only PCI card vendor. ★ Virident. ✦ Previously a MySQL Appliance vendor. ✦ Switched business model in ~2010 to just ship PCI Flash cards. ✦ Very good, consistent results. 16Wednesday, March 9, 2011
  • 17. The current market (2) ★ Intel/OCZ/other. ✦ Typically aims for pro-desktop market. ✦ Does not necessarily offer the same features/promises as the “enterprise hardware”... 17Wednesday, March 9, 2011
  • 18. You pay more for... ★ Greater amount of over provisioning (more consistent). ★ Internal redundancy (aka RAID). ★ More complex firmware (more consistent). ★ Guarantee of durability (such as a capacitor). ★ Greater life-span (more write cycles). ★ Better Performance (much more IOPS). 18Wednesday, March 9, 2011
  • 19. Fusion-io 19Wednesday, March 9, 2011
  • 20. Performance Specification ★ 160G SLC ✦ 110K read IOPS (4K) ✦ 26us read latency. ★ 320G MLC ✦ 71K read IOPS. ✦ 41us read latency. ★ “Duo” Range (not covered). ★ Lifetime: ✦ SLC flash @ 40% write duty | 25 calendar years ✦ MLC flash @ 20% write duty | 10 calendar years ✦ MLC flash @ 40% write duty | 5 calendar years 20Wednesday, March 9, 2011
  • 21. Fusion-io Overview ★ Fast. Very fast. ✦ Cheaper than disks in terms of $-per IOPS. ★ PCI-E - closest to CPU. ★ Durability. ★ Shares host memory / CPU ★ Most complex part - firmware. ★ Large amount of space reservation for heavy writes. 21Wednesday, March 9, 2011
  • 22. Fusion-io drawbacks ★ Expensive. Let’s say “$6000+” (retail; your price may be less). ✦ For full performance, requires additional 25% space reservation. ✦ DRAM is actually probably cheaper per GB. ★ PCI-E is not hot swap. ✦ Also has potential for errors (when host fails, garbage keeps being sent. Fusion-io handles this well.) 22Wednesday, March 9, 2011
  • 23. Fusion-io durability ★ Cache is located on host system. ★ “Transaction log” to prevent lost data. ✦ Crash recovery. 23Wednesday, March 9, 2011
  • 24. Fusion-io read performance 160GB SLC card 8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time RAID 10 is Dell Perc 6i on 8 disks 2.5” 15 RPM SAS 24Wednesday, March 9, 2011
  • 25. Fusion-io write performance ★ 8 threads: 20K IOPS (314MB/sec), 0.26 ms 95% response time. 25Wednesday, March 9, 2011
  • 26. Fusion-io databases ★ Many read / write threads to utilize throughput. ★ “MySQL” is not able to fully use it. ✦ Better in 5.5, MySQL-5.1-plugin, XtraDB. ★ InnoDB IO path “needs work”. 26Wednesday, March 9, 2011
  • 27. Virident TachIOn 27Wednesday, March 9, 2011
  • 28. Virident ★ PCI interface. ★ Has NAND flash upgrade modules. ★ Good stable results. ★ Advertised 300,000 IOPS in 75:25 (read:write). 28Wednesday, March 9, 2011
  • 29. Virident Options ★ 300G, 400G, 600, 800G SLC cards. ✦ 400G is $13,600 ★ (More or less the same price range as Fusion-io). 29Wednesday, March 9, 2011
  • 30. 2010 Benchmarks: http://www.mysqlperformanceblog.com/2010/06/15/virident- 30 tachion-new-player-on-flash-pci-e-cards-market/Wednesday, March 9, 2011
  • 31. Intel SSDs 31Wednesday, March 9, 2011
  • 32. Intel SSDs ★ Were awesome in 2008. ✦ Many accolades, first SSDs that probably made sense for a lot of pro-desktop users. ★ A couple of iterations of firmware, but mostly intel treated customers like mushrooms for 2 years. ✦ No clear advance warning of road map. ✦ Finally a replacement 510 series announced last month. • Slides don’t feature these. Have not used them. 32Wednesday, March 9, 2011
  • 33. Intel Overview ★ SATA form factor. ★ Intel X25-M Gen 1 (50nm) & Gen 11 (35nm). ✦ MLC ★ Intel X25-E (50nm) ✦ SLC ✦ “Enterprise”. ★ New 510 series - just released last month. 33Wednesday, March 9, 2011
  • 34. X25-E ★ 32G / 64G ★ Throughput: 35K IOPS reads, 3.5K IOPS writes. ★ Latency: 75us reads, 85us writes. ★ 64G - $725 ✦ $11/GB ★ Write endurance: ✦ 1 petabyte of random writes (32G) ✦ 2 petabytes of random writes (64G) 34Wednesday, March 9, 2011
  • 35. X25-M Gen II ★ 80G / 160G ★ Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes. ★ Latency: 65us reads, 85us writes. ★ 160GB - $415 ✦ ~$3 / GB ★ Write Endurance. ✦ Not mentioned in official specification. 35Wednesday, March 9, 2011
  • 36. X25-E and X25-M ★ Even if “E” is enterprise - power loss means data loss. ✦ Loss of transactions. ★ You can disable write cache, but performance is woeful. 36Wednesday, March 9, 2011
  • 37. X25 Deployments ★ RAID ✦ Software / hardware? ✦ Level 0? 1? 10? 5? 50? ★ Engineering process could be complicated and expensive. ✦ There are/were ready solutions (Schooner[1], Gear6[2], Cisco servers). [1] Changed business model recently. 37 [2] Went broke.Wednesday, March 9, 2011
  • 38. Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 38Wednesday, March 9, 2011
  • 39. MySQL Specific (1) ★ SSD is very good at Random reads. ✦ Not so good at sequential writes! ★ Data files on SSD. ✦ Table files (*.ibd). ✦ Rollback segments (ibdata1). ★ Logs on RAID with BBU. ✦ Binary logs. ✦ Transaction logs. ✦ Double write buffer. ✦ Insert buffer. ✦ Slow log, error log, general log. 39 See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.htmlWednesday, March 9, 2011
  • 40. MySQL Specific (2) ★ Buy memory, or buy SSDs? ✦ [Usually] Buy memory when it’s possible. 40Wednesday, March 9, 2011
  • 41. Other Reasons to use Flash (1) ★ Server Consolidation. ✦ Hard drives do ~100-200 IOPS* ✦ Now one card can get 100K (theorhetical)! ✦ ~x2 - x10 reduction in many cases (see craigslist). 41 * Assuming no RAID controller performing additional merging.Wednesday, March 9, 2011
  • 42. Other Reasons to use Flash (2) ★ Power consumption reduction. ✦ “Transactions per watt” incredibly lower. • See: http://www.percona.com/files/percona-live/jeremy- Craigslist.pptx.pdf ✦ Important for a large number of people. Even if power is cheap, colo facilities often limit availability per-rack. 42Wednesday, March 9, 2011
  • 43. Other Reasons to use Flash (3) ★ Limit variance / risk of operational issues from cold starts. ✦ Easy to see something like an advertising network miss response time goals when aim is 50ms/page. • Each IO is ~10ms. • Following a few secondary keys to a primary key and you miss it. ★ Good for throughput too. 43Wednesday, March 9, 2011
  • 44. Applications must changeWednesday, March 9, 2011
  • 45. Short Term (1) ★ Multi-threaded IO is required to exploit all throughput offered. ✦ InnoDB Plugin, MySQL 5.5 ready. ✦ Many other databases are not ready. 45Wednesday, March 9, 2011
  • 46. Short Term (2) ★ Opportunities for Multi-level caches when data exceeds SSDs size. ✦ See Flashcache (Facebook), ZFS L2 ARC, Veritas. 46Wednesday, March 9, 2011
  • 47. Long Term ★ Decades of hard drive assumptions about random IO cost need to be unwound. ✦ For example, InnoDB, Oracle, PostgreSQL work like this... 47Wednesday, March 9, 2011
  • 48. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 49. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 50. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 51. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 52. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 53. Basic Operation (High Level) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48Wednesday, March 9, 2011
  • 54. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 55. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 56. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 57. Basic Operation (cont.) Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 58. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 59. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 60. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 61. Basic Operation (cont.) 01010 Log Files UPDATE City SET name = Morgansville WHERE name = Brisbane AND CountryCode=AUS Tablespace Buffer Pool 49Wednesday, March 9, 2011
  • 62. Long Term (cont.) ★ Examples of “the database is the log” for MySQL are the PBXT and RethinkDB storage engines. 50Wednesday, March 9, 2011
  • 63. Storage Hardware also changes ★ Most of us used to buying RAID controllers, placing disks below them. ✦ Only a very limited number of RAID controllers understand SSDS. ✦ RAID controllers are used to optimizing IO for devices capable of 100-200 IOPS. ✦ If we look at Fusion-IO, the devices also internally RAID (~RAID4). 51Wednesday, March 9, 2011
  • 64. Technologies to look at ★ More PCI express cards. ✦ Potential to lower barrier to entry - only ~2-3 players, competition not as hot as it could be (yet). ★ More Enterprise focused MLC. ✦ Better software (firmware) means more wear levelling, improved performance, etc. ✦ More storage in fewer cells = lower cost. ★ Violin Memory ✦ I am not hands-on familiar with their technology, but they have some very high end offerings. ✦ Expect more awesome high end offerings (all vendors). 52Wednesday, March 9, 2011
  • 65. Questions ★ Thank you for Confoo for letting me speak about such a niche topic! ★ If I’m out of time, please feel free to catch me around. 53Wednesday, March 9, 2011