SSDs, IMDGs and All the Rest 
A short intro into how SSDs are 
powering the data revolution 
Uri Cohen 
Head of Product @ GigaSpaces 
@uri1803 
#jaxlondon 2014
The Data Processing Hierarchy
But Data Amounts Just Keep Growing
But We Have a Performance Gap
In Memory 
Computing 
to the 
Rescue? 
Not enough anymore… 
• Average GigaSpaces XAP 
cluster size grew 5-10 fold 
since 2008 
• We’re in the realm of 
terabytes, not gigabytes
SSD to Save 
the Day! 
https://www.mimoco.com
(It Actually 
Looks More 
Like This)
Some Numbers 
Level Access time Typical size 
Registers instantaneous under 1KB 
Level 1 Cache 1-3 ns 64KB per core 
Level 2 Cache 3-10 ns 256KB per core 
Level 3 Cache 10-20 ns 2-20 MB per chip 
Main Memory 30-60 ns 4-32 GB per system 
Hard Disk 3,000,000-10,000,000 ns over 1TB
Some Numbers 
Level Random Access Time Typical Size 
Registers instantaneous under 1KB 
Level 1 Cache 1-3 ns 64KB per core 
Level 2 Cache 3-10 ns 256KB per core 
Level 3 Cache 10-20 ns 2-20 MB per chip 
Main Memory 30-60 ns 4-32 GB per system 
SSD < 1,000,000 ns 128GB – 2TB 
Hard Disk 3,000,000-10,000,000 ns over 1TB
Performance Is All the Rage 
http://arstechnica.com/information-technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-work/
Is It All Roses 
and Daisies?
Step Back – 
How SSDs 
Work
The Foundation - NAND Chips
NAND Traits 
Space-efficient 
(60% less than NOR) 
 Effectively only 
NAND is used 
commercially
NAND Traits 
Can only write and 
read whole pages, 
4096 or 8192 bytes 
at a time 
 Modern FSs work 
this way anyway (but 
keep that in mind for 
later)
NAND Traits 
Limited life span 
(5K-10K write/erase 
cycles) 
 Need to evenly 
distribute load across 
all blocks
NAND Traits 
You cannot update 
a page “in place” 
 So why not delete 
it and write a new one 
instead?
Duh, you can 
only delete 
whole blocks
Typical Update Cycle
Typical 
Update Cycle 
• Updating 4096 
(or less) bytes of 
data can result in 
2MB of data 
moving around on 
the SSD 
• It’s called 
Write Amplification
Controllers 
to the 
Rescue
Write Caching
Garbage 
Collection 
(Grrrrrr….) 
Compacts 
fragmented disk 
blocks  but has a 
performance cost 
• Modern SSDs try to do 
this in the 
background... 
• When no empty blocks 
are available, GC must 
be done before ANY 
write can go through
Striping
Wear 
Leveling 
A bag of techniques 
the controller uses 
to keep all of the 
flash cells at roughly 
the same level of 
use
Dedupe & Compression
Databases, 
Charge 
Ahead! 
http://cdn.pcworld.idg.com.au/article/images/740x500/dimg/larry-mario_500.jpg
The Naive - 
MySQL (or 
PostgreSQL, 
Oracle, 
Mongo, …) 
Let’s just use it! 
(and write data 
in place FTW)
The Naive - 
MySQL (or 
PostgreSQL, 
Oracle, 
Mongo, …) 
• They all perform 
buffering of 
writes before 
flushing to disk 
• ... but flushes 
are still 
RANDOM writes
Source: Anandtech
Source: Anandtech
Cassandra 
Already 
Optimized 
(But for 
what?)
Cassandra Write Path 
http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
Cassandra Write Path 
http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
Cassandra Write Path 
http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
Cassandra Write Path 
http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
C* 
Observations 
(for SSDs) 
• All disk writes are 
sequential and append 
only 
• Compaction is applied 
when merging SSTables 
• SSTables are immutable 
once written 
 No write 
amplification
But Still… 
• Read path is 
complex 
• Compaction can 
cause performance 
variations
Why DO WE 
Treat SSDs 
the Same as 
HDDs?
Software 
Optimizations 
Direct access: 
• No kernel space 
overhead 
• TRIM 
• Multithreading 
• Caching in DRAM 
• On Disk and 
DRAM Indexing
Flash 
Optimized 
APIs
How We Did It
43 
RAM Only : ~1M read Txns/sec 
RAM + SSD: 242K read Txns/sec 
Raw Performance Numbers
Looking at It from a Cost Perspective 
44 
While Reducing Servers by 50% 
Provides 2x – 3.6x Better TPS/$ 
- 1KB object size and uniform distribution 
- 2 sockets 2.8GHz CPU with total 24 cores, CentOS 5.8, 2 FusionIO SLC PCIe cards RAID 
- YCSB measurements performed by SanDisk 
Assumptions: 1TB Flash = $2K; 1TB RAM = $20K
Resources 
• http://arstechnica.com/information-technology/ 
2012/06/inside-the-ssd-revolution-how-solid-state- 
disks-really-work/ 
• http://www.slideshare.net/rbranson/cassandra-and-solid-state- 
drives 
• http://www.sandisk.com/enterprise/zetascale/ 
• http://www.gigaspaces.com/xap-memoryxtend-flash-performance- 
big-data
Thank You!

How to randomly access data in close-to-RAM speeds but a lower cost with SSD’s - Uri Cohen

  • 1.
    SSDs, IMDGs andAll the Rest A short intro into how SSDs are powering the data revolution Uri Cohen Head of Product @ GigaSpaces @uri1803 #jaxlondon 2014
  • 2.
  • 3.
    But Data AmountsJust Keep Growing
  • 4.
    But We Havea Performance Gap
  • 5.
    In Memory Computing to the Rescue? Not enough anymore… • Average GigaSpaces XAP cluster size grew 5-10 fold since 2008 • We’re in the realm of terabytes, not gigabytes
  • 6.
    SSD to Save the Day! https://www.mimoco.com
  • 7.
    (It Actually LooksMore Like This)
  • 8.
    Some Numbers LevelAccess time Typical size Registers instantaneous under 1KB Level 1 Cache 1-3 ns 64KB per core Level 2 Cache 3-10 ns 256KB per core Level 3 Cache 10-20 ns 2-20 MB per chip Main Memory 30-60 ns 4-32 GB per system Hard Disk 3,000,000-10,000,000 ns over 1TB
  • 9.
    Some Numbers LevelRandom Access Time Typical Size Registers instantaneous under 1KB Level 1 Cache 1-3 ns 64KB per core Level 2 Cache 3-10 ns 256KB per core Level 3 Cache 10-20 ns 2-20 MB per chip Main Memory 30-60 ns 4-32 GB per system SSD < 1,000,000 ns 128GB – 2TB Hard Disk 3,000,000-10,000,000 ns over 1TB
  • 10.
    Performance Is Allthe Rage http://arstechnica.com/information-technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-work/
  • 11.
    Is It AllRoses and Daisies?
  • 12.
    Step Back – How SSDs Work
  • 13.
    The Foundation -NAND Chips
  • 14.
    NAND Traits Space-efficient (60% less than NOR)  Effectively only NAND is used commercially
  • 15.
    NAND Traits Canonly write and read whole pages, 4096 or 8192 bytes at a time  Modern FSs work this way anyway (but keep that in mind for later)
  • 16.
    NAND Traits Limitedlife span (5K-10K write/erase cycles)  Need to evenly distribute load across all blocks
  • 17.
    NAND Traits Youcannot update a page “in place”  So why not delete it and write a new one instead?
  • 18.
    Duh, you can only delete whole blocks
  • 19.
  • 20.
    Typical Update Cycle • Updating 4096 (or less) bytes of data can result in 2MB of data moving around on the SSD • It’s called Write Amplification
  • 21.
  • 22.
  • 23.
    Garbage Collection (Grrrrrr….) Compacts fragmented disk blocks  but has a performance cost • Modern SSDs try to do this in the background... • When no empty blocks are available, GC must be done before ANY write can go through
  • 24.
  • 25.
    Wear Leveling Abag of techniques the controller uses to keep all of the flash cells at roughly the same level of use
  • 26.
  • 27.
    Databases, Charge Ahead! http://cdn.pcworld.idg.com.au/article/images/740x500/dimg/larry-mario_500.jpg
  • 28.
    The Naive - MySQL (or PostgreSQL, Oracle, Mongo, …) Let’s just use it! (and write data in place FTW)
  • 29.
    The Naive - MySQL (or PostgreSQL, Oracle, Mongo, …) • They all perform buffering of writes before flushing to disk • ... but flushes are still RANDOM writes
  • 30.
  • 31.
  • 32.
  • 33.
    Cassandra Write Path http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
  • 34.
    Cassandra Write Path http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
  • 35.
    Cassandra Write Path http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
  • 36.
    Cassandra Write Path http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
  • 37.
    C* Observations (forSSDs) • All disk writes are sequential and append only • Compaction is applied when merging SSTables • SSTables are immutable once written  No write amplification
  • 38.
    But Still… •Read path is complex • Compaction can cause performance variations
  • 39.
    Why DO WE Treat SSDs the Same as HDDs?
  • 40.
    Software Optimizations Directaccess: • No kernel space overhead • TRIM • Multithreading • Caching in DRAM • On Disk and DRAM Indexing
  • 41.
  • 42.
  • 43.
    43 RAM Only: ~1M read Txns/sec RAM + SSD: 242K read Txns/sec Raw Performance Numbers
  • 44.
    Looking at Itfrom a Cost Perspective 44 While Reducing Servers by 50% Provides 2x – 3.6x Better TPS/$ - 1KB object size and uniform distribution - 2 sockets 2.8GHz CPU with total 24 cores, CentOS 5.8, 2 FusionIO SLC PCIe cards RAID - YCSB measurements performed by SanDisk Assumptions: 1TB Flash = $2K; 1TB RAM = $20K
  • 45.
    Resources • http://arstechnica.com/information-technology/ 2012/06/inside-the-ssd-revolution-how-solid-state- disks-really-work/ • http://www.slideshare.net/rbranson/cassandra-and-solid-state- drives • http://www.sandisk.com/enterprise/zetascale/ • http://www.gigaspaces.com/xap-memoryxtend-flash-performance- big-data
  • 46.

Editor's Notes

  • #20 Updating 4096 bytes of data can result in 2MB of data being removed and rewritten
  • #21 Updating 4096 bytes of data can result in 2MB of data being removed and rewritten
  • #26 Increases write amplification
  • #27 Mention SandForce Compress, check for dups, discard Updates to a file cause a lot less writes Can also span across file
  • #44 Uri
  • #45 Uri