1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Haf Saba, Director, Sales Engineering, APJ
Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
3
We make NAND flash (not just USB sticks)
2001 2003 2005 2007 2009 2011 2013 2015 2017
SanDisk Technology Transitions
1Gb, X2-160nm
128Gb, 19nm
128Gb, 1Ynm
Note: Images are not to scale
128Gb, 15nm
2Gb, 130nm
4Gb, 90nm
8Gb,-70nm
16Gb, 56nm
32Gb, 43nm
64Gb, 24nm
64Gb, 32nm
3D BiCS
Old Model
 Monolithic, large upfront
investments, and fork-lift
upgrades
 Proprietary storage OS
 Costly: $$$$$
New SD-AFS Model
 Disaggregate storage, compute, and software
for better scaling and costs
 Best-in-class solution components
 Open source software - no vendor lock-in
 Cost-efficient: $
Software-defined All-Flash Storage
6
Proceed with caution
 Storage performance is hugely affected by seemingly small details
 All HW is not equal – Switches, NICs, HBAs, SSDs all matter
• Drivers abstraction doesn’t hide dynamic behavior
 All SW is not equal – Distro, Patches, Drivers, Configuration matter
 Typically large delta between “default” and “tuned” system perf
 What’s a user to do?
Software Defined Storage – what’s NOT new
8
The Future is here…
The InfiniFlash™ System
9
 64-512 TB
JBOD of flash in
3U
 Up to 2M IOPS,
<1ms latency,
Up to 15 GB/s
throughput
 Energy Efficient
~400W power draw
 Connect
up to 8 servers
 Simple yet
Scalable
Infiniflash 550
64 hot swap
8TB cards
Thermal monitoring
And alerts
InfiniFlash™
8TB Flash-Card Innovation
• Enterprise-grade power-fail safe
• Latching integrated & monitored
• Directly samples air temp
• New flash form factor not SSD-based
Non-disruptive Scale-Up & Scale-
Out
• Capacity on demand
• Serve high growth Big Data
• 3U chassis starting at 64TB up to
512TB
• 8 to 64 8TB flash cards (SAS)
• Compute on demand
• Serve dynamic apps without
IOPS/TB bottlenecks
• Add up to 8 servers
Enterprise-grade SanDisk
Flash with power fail
protection
InfiniFlash™(continued)
RAS (Reliability, Availability &
Serviceability)
• Resilient - MTBF 1.5+ million hours
• Hot-swappable architecture - easy FRU of
fans, SAS expander boards, power supplies,
flash cards
• Low power – typical workload 400-500W
• 150W(idle) - 750W(abs -max)
4 hot swap fans and 2 power supplies
Designed for Big Data Workloads @ PB Scale
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
A Closer Look
16
InfiniFlash IF550 (HW + SW)
 Ultra-dense High Capacity Flash storage
 Highly scalable performance
 Cinder, Glance and Swift storage
 Enterprise-Class storage features
 Ceph Optimized for SanDisk flash
InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for
Hardware
• Extensive Ceph mods
Hardware Configured
for Software
• Density, power, architecture
 Ceph has over 50 tuning parameters
that results in 5x – 6x performance
improvement
IF550 - Enhancing Ceph for Enterprise Consumption
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality
• All Ceph Performance improvements developed by SanDisk are contributed back to community
19
SanDisk / RedHat
or Community
Distribution
 Out-of-the Box
configurations tuned for
performance with Flash
 Sizing & planning tool
 InfiniFlash drive
management
integrated into Ceph
management
 Ceph installer built for InfiniFlash
 High performance iSCSI storage
 log collection tool
 Enterprise hardened SW + HW QA
How did we get here?
20
 Starting working with Ceph over 2.5 years ago (Dumpling)
 Aligned on vision of scale-out Enterprise storage
• Multi-protocol design
• Cluster / Cloud oriented
• Open Source Commitment
 SanDisk’s engagement with Ceph
• Flash levels of performance
• Enterprise quality
• Support tools for our product offerings
Ceph and SanDisk
Optimising Ceph for the all-flash Future
 Ceph optimized for HDD
• Tuning AND algorithm changes needed for Flash optimization
 Quickly determined that the OSD was the major bottleneck
• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5
cores)
 Examined and rejected multiple OSDs per SSD
• Failure Domain / Crush rules would be a nightmare
SanDisk: OSD Read path Optimisation
 Context switches matter at flash rates
• Too much “put it in a queue for another thread” and lock contention
 Socket handling matters too!
• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
 Lots of other simple things
• Eliminate repeated look-ups, Redundant string copies, etc
 Contributed improvements to Emperor, Firefly and Giant releases
 Now obtain about >200K IOPS / OSD using around 12 CPU cores/OSD (Jewel)
SanDisk: OSD Write path Optimization
 Write path strategy was classic HDD
• Journal writes for minimum foreground latency
• Process journal in batches in the background
 Inefficient for flash
 Modified buffering/writing strategy for Flash (Jewel!)
• 2.5x write throughput and avg latency is ½ of Hammer
InfiniFlash Ceph Performance
- Measured on a 512TB Cluster
Test Configuration
 2 x InfiniFlash systems 256TB
each
 8 x OSD nodes
• 2x E5-2697, 14C 2.6Hz v3 ,
8x 16GB DDR4 ECC
2133Mhz
1x Mellanox X3 Dual 40GbE
• Ubuntu 14.04.02 LTS 64bit
 8 – 10 Client nodes
 Ceph Version: sndk-ifos-1.3.0.317,
based on Ceph 10.2.1 (Jewel)
IFOS Block IOPS Performance
Highlights
• 4k numbers are cpu bound, increase in server
cpu will improve IOPS by ~11%
• 64k and higher block size bandwidth is close to
raw box bandwidth.
• 256k Random Read numbers can be increased
further based on number of clients, able to
achieve > 90% drive saturation with 14 clients.
1521231
347628
82456
6
22 21
0
5
10
15
20
25
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Read
Sum of
IOPs
201465
55648
16289
0.8
3.5
4.1
0
1
2
3
4
5
0
50000
100000
150000
200000
250000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Write
Sum
of…
Write Performance is on 2 x copy configuration.
IFOS Block workload Latency Performance
Environment
• Librbd IO read latency measured on Golden Config with 2 way replication at host level having 8 osd nodes, IO duration 20min.
• fio Read IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
• fio Write IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
64K Random Read 64K Random Write
Average latency : 6.3ms
Latency Range
(µs) Percentile
500 2.21
750 0.22
1000 7.43
2000 62.72
4000 26.11
10000 1.27
20000 0.03
50000 0.01
Latency Range
(µs) Percentile
1000 0.33
2000 43.16
4000 28.11
10000 21
20000 3.31
50000 2.17
100000 1.35
250000 0.4
500000 0.11
750000 0.04
1000000 0.01
2000000 0.01
Average latency : 1.7ms
2.21 0.22
7.43
62.72
26.12
1.28 0.03 0.01
0
20
40
60
80
Percentile
Latency range in us
Random Read Latency
Histogram
99 percent of
the IOPs is
within 5ms
latency
0.33
43.16
28.11
21
3.312.171.350.40.110.040.010.01
0
10
20
30
40
50
Percentile
Latency Range in us
Random Write Latency
Histogram
99% IOPS : 178367.31 99% IOPS : 227936.61
IFOS Object Performance
 Erasure Coding provides equivalent of 3x
replica storage with only 1.2x storage
 Object performance is on par with Block
performance
 Higher node clusters = higher EC ratio = more
storage savings
• Replication Configuration
• OSD level replication with 2 copies
• Erasure Coding Configuration
• Node level Erasure coding with Couchy-Good 6+2
• Couchy Good is better suited to InfiniFlash vs.
Reed Solomon
0
5
10
15
Repl (2x) -
Read
EC (6+2)
Read
Repl (2x) -
Write
EC (6+2)
Write
GBps
Protection Scheme
4M Objects Throughout - Erausre
Coding vs. Replication
IF550 Reference Configurations
Workload Small Medium Large
Small Block I/O
 2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
 1 OSD Server per 4-8 cards
o Dual E5-2680
o 64GB RAM
 2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
 1 OSD Server per 4-8 cards
o Dual E5-2687
o 128GB RAM
 2+ IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
 1 OSD Server per 4-8 cards
o Dual E5-2697+
o 128GB RAM
Throughput
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 16 cards
o Dual E5-2660
o 64GB RAM
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 16 cards
o Dual E5-2680
o 128GB RAM
 2+ IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 16 cards
o Dual E5-2680
o 128GB RAM
Mixed
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 8-16
cards
o Dual E5-2680
o 64GB RAM
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 8-16
cards
o Dual E5-2690+
o 128GB RAM
 2+ IF150
o 128TB to 256 TB
Flash per enclosure
(optional
performance CARD)
 1 OSD Server per 8-16
cards
o Dual E5-2695+
o 128GB RAM
InfiniFlash TCO Advantage
 Reduce the replica count from 3 to 2
 Less compute, less HW and SW
• TCO analysis based on a US customer’s OPEX & Cost
data for a 5PB deployment
33
$-
$5
$10
$15
$20
$25
$30
InfiniFlash External AFA DAS SSD node DAS 10k HDD
node
Millions
5 Year TCO
CAPEX OPEX
0 20 40
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Data Center
Racks
$- $2,000 $4,000
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Thousands
Total Energy Cost
34
35
Flash is on par or cheaper than buying HDDs
36
What’s on the roadmap?
SanDisk: Potential Future Improvements
 RDMA intra-cluster communication
• Significant reduction in CPU / IOP
 BlueStore
• Significant reduction in write amplification -> even higher write
performance
 Memory allocation
• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
 Erasure Coding for Blocks (native)
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
Time to fix the write path algorithm
 Review of FileStore
• What’s wrong with FileStore
• XFS + levelDB
• Missing transactional semantics for metadata and data
• Missing virtual-copy and merge semantics
• BTRFS implementation of these isn’t general enough
 Snapshot/rollback overhead too expensive for frequent use
 Transaction semantics aren’t crash proof
 Bad Write amp.
 Bad jitter due to unpredictable file system
 Bad CPU utilization, syncfs is VERY expensive
BlueStore
 One, Two or Three raw block devices
• Data, Metadata/WAL and KV Journaling
• When combined no fixed partitioning is
needed
 Use a single transactional KV store for all
metadata
• Semantics are well matched to
ObjectStore transactions
 Use raw block device for data storage
• Support Flash, PMR and SMR HDD
ObjectStore
BlueStore
KeyValueDB
Data
MetaData
BlueFS
Operation Decoder
Journal
Client/Gateway Operations Peer-to-Peer Cluster
Management
Network Interface
BlueStore vs FileStore
1 800GB P3700 card (4 OSDs per), 64GB ram, 2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 1 x Intel 40GbE link
client fio processes and mon were on the same nodes as the OSDs.
Emerging Storage Solutions (EMS) SanDisk Confidential 41
KV Store Options
 RocksDB is a Facebook extension of levelDB
– Log Structured Merge (LSM) based
– Ideal when metadata is on HDD
– Merge is effectively host-based GC when run on flash
 ZetaScale™ from SanDisk® now open sourced
– B-tree based
– Ideal when metadata is on Flash
– Uses device-based GC for max performance
BlueStore ZetaScale v RocksDB Performance
Test Setup:
1 OSD, 8TB SAS SSD, 10GB ram, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz , fio, 32 thds, 64 iodepth, 6TB dataset, 30
min
0.436
1.005
3.970
0.95
2.83
9.29
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
0/100 70/30 100/0
IOPsinKs
Read/Write Ratio
Random Read/Write 4K IOPs per OSD
BlueStore(RocksDB)
BlueStore(ZetaScale)
The InfiniFlash™ System ...
Power
70% Less
Speed
40x Faster
than SAN
Density
10X Higher
Reliability
20x Better
AFR
Cost
Up to 80%
Lower TCO
44
Thank You! @BigDataFlash
#bigdataflash
©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of
SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Ceph Day KL - Ceph on All-Flash Storage

  • 1.
    1 Ceph on All-FlashStorage – Breaking Performance Barriers Haf Saba, Director, Sales Engineering, APJ
  • 2.
    Forward-Looking Statements During ourmeeting today we will make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, industry trends, future products, product performance and product capabilities. This presentation also contains forward-looking statements attributed to third parties, which reflect their projections as of the date of issuance. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.
  • 3.
    3 We make NANDflash (not just USB sticks)
  • 4.
    2001 2003 20052007 2009 2011 2013 2015 2017 SanDisk Technology Transitions 1Gb, X2-160nm 128Gb, 19nm 128Gb, 1Ynm Note: Images are not to scale 128Gb, 15nm 2Gb, 130nm 4Gb, 90nm 8Gb,-70nm 16Gb, 56nm 32Gb, 43nm 64Gb, 24nm 64Gb, 32nm 3D BiCS
  • 5.
    Old Model  Monolithic,large upfront investments, and fork-lift upgrades  Proprietary storage OS  Costly: $$$$$ New SD-AFS Model  Disaggregate storage, compute, and software for better scaling and costs  Best-in-class solution components  Open source software - no vendor lock-in  Cost-efficient: $ Software-defined All-Flash Storage
  • 6.
  • 7.
     Storage performanceis hugely affected by seemingly small details  All HW is not equal – Switches, NICs, HBAs, SSDs all matter • Drivers abstraction doesn’t hide dynamic behavior  All SW is not equal – Distro, Patches, Drivers, Configuration matter  Typically large delta between “default” and “tuned” system perf  What’s a user to do? Software Defined Storage – what’s NOT new
  • 8.
  • 9.
    The InfiniFlash™ System 9 64-512 TB JBOD of flash in 3U  Up to 2M IOPS, <1ms latency, Up to 15 GB/s throughput  Energy Efficient ~400W power draw  Connect up to 8 servers  Simple yet Scalable
  • 10.
  • 11.
    64 hot swap 8TBcards Thermal monitoring And alerts
  • 12.
    InfiniFlash™ 8TB Flash-Card Innovation •Enterprise-grade power-fail safe • Latching integrated & monitored • Directly samples air temp • New flash form factor not SSD-based Non-disruptive Scale-Up & Scale- Out • Capacity on demand • Serve high growth Big Data • 3U chassis starting at 64TB up to 512TB • 8 to 64 8TB flash cards (SAS) • Compute on demand • Serve dynamic apps without IOPS/TB bottlenecks • Add up to 8 servers
  • 13.
  • 14.
    InfiniFlash™(continued) RAS (Reliability, Availability& Serviceability) • Resilient - MTBF 1.5+ million hours • Hot-swappable architecture - easy FRU of fans, SAS expander boards, power supplies, flash cards • Low power – typical workload 400-500W • 150W(idle) - 750W(abs -max) 4 hot swap fans and 2 power supplies
  • 15.
    Designed for BigData Workloads @ PB Scale CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
  • 16.
  • 17.
    InfiniFlash IF550 (HW+ SW)  Ultra-dense High Capacity Flash storage  Highly scalable performance  Cinder, Glance and Swift storage  Enterprise-Class storage features  Ceph Optimized for SanDisk flash
  • 18.
    InfiniFlash SW +HW Advantage Software Storage System Software tuned for Hardware • Extensive Ceph mods Hardware Configured for Software • Density, power, architecture  Ceph has over 50 tuning parameters that results in 5x – 6x performance improvement
  • 19.
    IF550 - EnhancingCeph for Enterprise Consumption • SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality • All Ceph Performance improvements developed by SanDisk are contributed back to community 19 SanDisk / RedHat or Community Distribution  Out-of-the Box configurations tuned for performance with Flash  Sizing & planning tool  InfiniFlash drive management integrated into Ceph management  Ceph installer built for InfiniFlash  High performance iSCSI storage  log collection tool  Enterprise hardened SW + HW QA
  • 20.
    How did weget here? 20
  • 21.
     Starting workingwith Ceph over 2.5 years ago (Dumpling)  Aligned on vision of scale-out Enterprise storage • Multi-protocol design • Cluster / Cloud oriented • Open Source Commitment  SanDisk’s engagement with Ceph • Flash levels of performance • Enterprise quality • Support tools for our product offerings Ceph and SanDisk
  • 22.
    Optimising Ceph forthe all-flash Future  Ceph optimized for HDD • Tuning AND algorithm changes needed for Flash optimization  Quickly determined that the OSD was the major bottleneck • OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)  Examined and rejected multiple OSDs per SSD • Failure Domain / Crush rules would be a nightmare
  • 23.
    SanDisk: OSD Readpath Optimisation  Context switches matter at flash rates • Too much “put it in a queue for another thread” and lock contention  Socket handling matters too! • Too many “get 1 byte” calls to the kernel for sockets • Disable Nagle’s algorithm to shorten operation latency  Lots of other simple things • Eliminate repeated look-ups, Redundant string copies, etc  Contributed improvements to Emperor, Firefly and Giant releases  Now obtain about >200K IOPS / OSD using around 12 CPU cores/OSD (Jewel)
  • 24.
    SanDisk: OSD Writepath Optimization  Write path strategy was classic HDD • Journal writes for minimum foreground latency • Process journal in batches in the background  Inefficient for flash  Modified buffering/writing strategy for Flash (Jewel!) • 2.5x write throughput and avg latency is ½ of Hammer
  • 25.
    InfiniFlash Ceph Performance -Measured on a 512TB Cluster
  • 26.
    Test Configuration  2x InfiniFlash systems 256TB each  8 x OSD nodes • 2x E5-2697, 14C 2.6Hz v3 , 8x 16GB DDR4 ECC 2133Mhz 1x Mellanox X3 Dual 40GbE • Ubuntu 14.04.02 LTS 64bit  8 – 10 Client nodes  Ceph Version: sndk-ifos-1.3.0.317, based on Ceph 10.2.1 (Jewel)
  • 27.
    IFOS Block IOPSPerformance Highlights • 4k numbers are cpu bound, increase in server cpu will improve IOPS by ~11% • 64k and higher block size bandwidth is close to raw box bandwidth. • 256k Random Read numbers can be increased further based on number of clients, able to achieve > 90% drive saturation with 14 clients. 1521231 347628 82456 6 22 21 0 5 10 15 20 25 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 4 64 256 BandwidthinGB/s IOPs Block Size Random Read Sum of IOPs 201465 55648 16289 0.8 3.5 4.1 0 1 2 3 4 5 0 50000 100000 150000 200000 250000 4 64 256 BandwidthinGB/s IOPs Block Size Random Write Sum of… Write Performance is on 2 x copy configuration.
  • 29.
    IFOS Block workloadLatency Performance Environment • Librbd IO read latency measured on Golden Config with 2 way replication at host level having 8 osd nodes, IO duration 20min. • fio Read IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd) • fio Write IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd) 64K Random Read 64K Random Write Average latency : 6.3ms Latency Range (µs) Percentile 500 2.21 750 0.22 1000 7.43 2000 62.72 4000 26.11 10000 1.27 20000 0.03 50000 0.01 Latency Range (µs) Percentile 1000 0.33 2000 43.16 4000 28.11 10000 21 20000 3.31 50000 2.17 100000 1.35 250000 0.4 500000 0.11 750000 0.04 1000000 0.01 2000000 0.01 Average latency : 1.7ms 2.21 0.22 7.43 62.72 26.12 1.28 0.03 0.01 0 20 40 60 80 Percentile Latency range in us Random Read Latency Histogram 99 percent of the IOPs is within 5ms latency 0.33 43.16 28.11 21 3.312.171.350.40.110.040.010.01 0 10 20 30 40 50 Percentile Latency Range in us Random Write Latency Histogram 99% IOPS : 178367.31 99% IOPS : 227936.61
  • 31.
    IFOS Object Performance Erasure Coding provides equivalent of 3x replica storage with only 1.2x storage  Object performance is on par with Block performance  Higher node clusters = higher EC ratio = more storage savings • Replication Configuration • OSD level replication with 2 copies • Erasure Coding Configuration • Node level Erasure coding with Couchy-Good 6+2 • Couchy Good is better suited to InfiniFlash vs. Reed Solomon 0 5 10 15 Repl (2x) - Read EC (6+2) Read Repl (2x) - Write EC (6+2) Write GBps Protection Scheme 4M Objects Throughout - Erausre Coding vs. Replication
  • 32.
    IF550 Reference Configurations WorkloadSmall Medium Large Small Block I/O  2x IF150 o 128TB to 256 TB Flash per enclosure using performance card (4TB)  1 OSD Server per 4-8 cards o Dual E5-2680 o 64GB RAM  2x IF150 o 128TB to 256 TB Flash per enclosure using performance card (4TB)  1 OSD Server per 4-8 cards o Dual E5-2687 o 128GB RAM  2+ IF150 o 128TB to 256 TB Flash per enclosure using performance card (4TB)  1 OSD Server per 4-8 cards o Dual E5-2697+ o 128GB RAM Throughput  2x IF150 o 128TB to 512 TB Flash per enclosure  1 OSD Server per 16 cards o Dual E5-2660 o 64GB RAM  2x IF150 o 128TB to 512 TB Flash per enclosure  1 OSD Server per 16 cards o Dual E5-2680 o 128GB RAM  2+ IF150 o 128TB to 512 TB Flash per enclosure  1 OSD Server per 16 cards o Dual E5-2680 o 128GB RAM Mixed  2x IF150 o 128TB to 512 TB Flash per enclosure  1 OSD Server per 8-16 cards o Dual E5-2680 o 64GB RAM  2x IF150 o 128TB to 512 TB Flash per enclosure  1 OSD Server per 8-16 cards o Dual E5-2690+ o 128GB RAM  2+ IF150 o 128TB to 256 TB Flash per enclosure (optional performance CARD)  1 OSD Server per 8-16 cards o Dual E5-2695+ o 128GB RAM
  • 33.
    InfiniFlash TCO Advantage Reduce the replica count from 3 to 2  Less compute, less HW and SW • TCO analysis based on a US customer’s OPEX & Cost data for a 5PB deployment 33 $- $5 $10 $15 $20 $25 $30 InfiniFlash External AFA DAS SSD node DAS 10k HDD node Millions 5 Year TCO CAPEX OPEX 0 20 40 InfiniFlash External AFA DAS SSD node DAS 10k HDD node Data Center Racks $- $2,000 $4,000 InfiniFlash External AFA DAS SSD node DAS 10k HDD node Thousands Total Energy Cost
  • 34.
  • 35.
    35 Flash is onpar or cheaper than buying HDDs
  • 36.
  • 37.
    SanDisk: Potential FutureImprovements  RDMA intra-cluster communication • Significant reduction in CPU / IOP  BlueStore • Significant reduction in write amplification -> even higher write performance  Memory allocation • tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default *  Erasure Coding for Blocks (native) * https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
  • 38.
    Time to fixthe write path algorithm  Review of FileStore • What’s wrong with FileStore • XFS + levelDB • Missing transactional semantics for metadata and data • Missing virtual-copy and merge semantics • BTRFS implementation of these isn’t general enough  Snapshot/rollback overhead too expensive for frequent use  Transaction semantics aren’t crash proof  Bad Write amp.  Bad jitter due to unpredictable file system  Bad CPU utilization, syncfs is VERY expensive
  • 39.
    BlueStore  One, Twoor Three raw block devices • Data, Metadata/WAL and KV Journaling • When combined no fixed partitioning is needed  Use a single transactional KV store for all metadata • Semantics are well matched to ObjectStore transactions  Use raw block device for data storage • Support Flash, PMR and SMR HDD ObjectStore BlueStore KeyValueDB Data MetaData BlueFS Operation Decoder Journal Client/Gateway Operations Peer-to-Peer Cluster Management Network Interface
  • 40.
    BlueStore vs FileStore 1800GB P3700 card (4 OSDs per), 64GB ram, 2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 1 x Intel 40GbE link client fio processes and mon were on the same nodes as the OSDs.
  • 41.
    Emerging Storage Solutions(EMS) SanDisk Confidential 41 KV Store Options  RocksDB is a Facebook extension of levelDB – Log Structured Merge (LSM) based – Ideal when metadata is on HDD – Merge is effectively host-based GC when run on flash  ZetaScale™ from SanDisk® now open sourced – B-tree based – Ideal when metadata is on Flash – Uses device-based GC for max performance
  • 42.
    BlueStore ZetaScale vRocksDB Performance Test Setup: 1 OSD, 8TB SAS SSD, 10GB ram, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz , fio, 32 thds, 64 iodepth, 6TB dataset, 30 min 0.436 1.005 3.970 0.95 2.83 9.29 0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 0/100 70/30 100/0 IOPsinKs Read/Write Ratio Random Read/Write 4K IOPs per OSD BlueStore(RocksDB) BlueStore(ZetaScale)
  • 43.
    The InfiniFlash™ System... Power 70% Less Speed 40x Faster than SAN Density 10X Higher Reliability 20x Better AFR Cost Up to 80% Lower TCO
  • 44.
    44 Thank You! @BigDataFlash #bigdataflash ©2015SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Editor's Notes

  • #16 Video continues to drive the need for storage, and Point-Of-View cameras like GoPro are producing compelling high resolution videos on our performance cards. People using smartphones to make high resolution videos choose our performance mobile cards also, driving the need for higher capacities. There is a growing customer base for us around the world, with one billion additional people joining the Global Middle Class between 2013 and 2020. These people will use smart mobile devices as their first choice to spend discretionary income on, and will expand their storage using removable cards and USB drives. We are not standing still, but creating new product categories to allow people to expand and share their most cherished memories. ___________________________________________________________
  • #19 Eg. If you’re using it for small blocks, you need more CPUs. However large objects can use less servers. Your choice on how you want to deploy it.
  • #24 All these are listed as various ineffeciencies. Originally about 10K IOPS before doing all the optimisations
  • #28 Ran a 1PB test on Hammer with 256TB scaling. Almost linearly scaling. You can’t get these numbers easily elsewhere.
  • #30 Typically latency is around 10ms for R and 20-40ms for W even on flash!
  • #32 EC is a customer configurable option. A lot more writes with 3 copy. SanDisk is working on block EC….right now just object
  • #34 The point here is flash is about the same as HDD
  • #40 FileStore – existing backend storage for CEPH, many deficits. BlueStore is the new architecture. This is a preview… Tech preview for the rest of this year. By the L release it will be production. We will switch to KV store and get rid of the journal. A journal invokes too many writes. Almost 4. KV Store will be RockDB but SD will introduce a flash optimised KV Store basedon ZetaScale.
  • #44 Today's flash solutions and arrays can address most of these problems – they are low power, high performance, somewhat scalable (though not to 10s of Pbs) and highly reliable but there is one thing that is holds it back – something missing – Favorable Economics – it’s simply way too expensive for @ scale workloads making flash out of reach. So we went to work as a team – our first investment was the best investment we ever made – a very clean sheet of paper !!! We knew that today’s HDD based storage solutions and today’s all flash arrays would not do the trick. We had to create something brand new that looks like nothing the world has seen. Substantiation Low Power 20 enclosures down to 2 – 100 watts per enclosure 24 Drives per enclosure (9w HDD, 7W SSD) = 93% power reductions or 1/16 the power. From his TCO Calculator. HDD 480 Drives -9w 20 Enclosures – 100w 4320 + 2000 = 6320w SSD 46 SSDs – 7w 176TB 2 Enclosures – 100w 536 Watts Extreme Performance 30x faster NoSQL transactions MongoDB Solution Brief   Scalable 4,500 virtual desktops in one rack Fusion ioVDI and VMware Horizon View: Reference Architecture for VDI   Reliable Accelerate Oracle Backup Using SanDisk Solid State Drives (SSDs) Accelerate Oracle Backup Using SanDisk Solid State Drives (SSDs)   Breakthrough Economics ~3x faster Hadoop jobs with half the servers Increasing Hadoop Performance with SanDisk SSDs