Ceph Day KL - Ceph on All-Flash Storage

1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Haf Saba, Director, Sales Engineering, APJ

Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.

3
We make NAND flash (not just USB sticks)

2001 2003 2005 2007 2009 2011 2013 2015 2017
SanDisk Technology Transitions
1Gb, X2-160nm
128Gb, 19nm
128Gb, 1Ynm
Note: Images are not to scale
128Gb, 15nm
2Gb, 130nm
4Gb, 90nm
8Gb,-70nm
16Gb, 56nm
32Gb, 43nm
64Gb, 24nm
64Gb, 32nm
3D BiCS

Old Model
 Monolithic, large upfront
investments, and fork-lift
upgrades
 Proprietary storage OS
 Costly: $$$$$
New SD-AFS Model
 Disaggregate storage, compute, and software
for better scaling and costs
 Best-in-class solution components
 Open source software - no vendor lock-in
 Cost-efficient: $
Software-defined All-Flash Storage

 Storage performance is hugely affected by seemingly small details
 All HW is not equal – Switches, NICs, HBAs, SSDs all matter
• Drivers abstraction doesn’t hide dynamic behavior
 All SW is not equal – Distro, Patches, Drivers, Configuration matter
 Typically large delta between “default” and “tuned” system perf
 What’s a user to do?
Software Defined Storage – what’s NOT new

The InfiniFlash™ System
9
 64-512 TB
JBOD of flash in
3U
 Up to 2M IOPS,
<1ms latency,
Up to 15 GB/s
throughput
 Energy Efficient
~400W power draw
 Connect
up to 8 servers
 Simple yet
Scalable

64 hot swap
8TB cards
Thermal monitoring
And alerts

InfiniFlash™
8TB Flash-Card Innovation
• Enterprise-grade power-fail safe
• Latching integrated & monitored
• Directly samples air temp
• New flash form factor not SSD-based
Non-disruptive Scale-Up & Scale-
Out
• Capacity on demand
• Serve high growth Big Data
• 3U chassis starting at 64TB up to
512TB
• 8 to 64 8TB flash cards (SAS)
• Compute on demand
• Serve dynamic apps without
IOPS/TB bottlenecks
• Add up to 8 servers

Enterprise-grade SanDisk
Flash with power fail
protection

InfiniFlash™(continued)
RAS (Reliability, Availability &
Serviceability)
• Resilient - MTBF 1.5+ million hours
• Hot-swappable architecture - easy FRU of
fans, SAS expander boards, power supplies,
flash cards
• Low power – typical workload 400-500W
• 150W(idle) - 750W(abs -max)
4 hot swap fans and 2 power supplies

Designed for Big Data Workloads @ PB Scale
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES

InfiniFlash IF550 (HW + SW)
 Ultra-dense High Capacity Flash storage
 Highly scalable performance
 Cinder, Glance and Swift storage
 Enterprise-Class storage features
 Ceph Optimized for SanDisk flash

InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for
Hardware
• Extensive Ceph mods
Hardware Configured
for Software
• Density, power, architecture
 Ceph has over 50 tuning parameters
that results in 5x – 6x performance
improvement

IF550 - Enhancing Ceph for Enterprise Consumption
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality
• All Ceph Performance improvements developed by SanDisk are contributed back to community
19
SanDisk / RedHat
or Community
Distribution
 Out-of-the Box
configurations tuned for
performance with Flash
 Sizing & planning tool
 InfiniFlash drive
management
integrated into Ceph
management
 Ceph installer built for InfiniFlash
 High performance iSCSI storage
 log collection tool
 Enterprise hardened SW + HW QA

 Starting working with Ceph over 2.5 years ago (Dumpling)
 Aligned on vision of scale-out Enterprise storage
• Multi-protocol design
• Cluster / Cloud oriented
• Open Source Commitment
 SanDisk’s engagement with Ceph
• Flash levels of performance
• Enterprise quality
• Support tools for our product offerings
Ceph and SanDisk

Optimising Ceph for the all-flash Future
 Ceph optimized for HDD
• Tuning AND algorithm changes needed for Flash optimization
 Quickly determined that the OSD was the major bottleneck
• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5
cores)
 Examined and rejected multiple OSDs per SSD
• Failure Domain / Crush rules would be a nightmare

SanDisk: OSD Read path Optimisation
 Context switches matter at flash rates
• Too much “put it in a queue for another thread” and lock contention
 Socket handling matters too!
• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
 Lots of other simple things
• Eliminate repeated look-ups, Redundant string copies, etc
 Contributed improvements to Emperor, Firefly and Giant releases
 Now obtain about >200K IOPS / OSD using around 12 CPU cores/OSD (Jewel)

SanDisk: OSD Write path Optimization
 Write path strategy was classic HDD
• Journal writes for minimum foreground latency
• Process journal in batches in the background
 Inefficient for flash
 Modified buffering/writing strategy for Flash (Jewel!)
• 2.5x write throughput and avg latency is ½ of Hammer

InfiniFlash Ceph Performance
- Measured on a 512TB Cluster

Test Configuration
 2 x InfiniFlash systems 256TB
each
 8 x OSD nodes
• 2x E5-2697, 14C 2.6Hz v3 ,
8x 16GB DDR4 ECC
2133Mhz
1x Mellanox X3 Dual 40GbE
• Ubuntu 14.04.02 LTS 64bit
 8 – 10 Client nodes
 Ceph Version: sndk-ifos-1.3.0.317,
based on Ceph 10.2.1 (Jewel)

IFOS Block IOPS Performance
Highlights
• 4k numbers are cpu bound, increase in server
cpu will improve IOPS by ~11%
• 64k and higher block size bandwidth is close to
raw box bandwidth.
• 256k Random Read numbers can be increased
further based on number of clients, able to
achieve > 90% drive saturation with 14 clients.
1521231
347628
82456
6
22 21
0
5
10
15
20
25
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Read
Sum of
IOPs
201465
55648
16289
0.8
3.5
4.1
0
1
2
3
4
5
0
50000
100000
150000
200000
250000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Write
Sum
of…
Write Performance is on 2 x copy configuration.

IFOS Block workload Latency Performance
Environment
• Librbd IO read latency measured on Golden Config with 2 way replication at host level having 8 osd nodes, IO duration 20min.
• fio Read IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
• fio Write IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
64K Random Read 64K Random Write
Average latency : 6.3ms
Latency Range
(µs) Percentile
500 2.21
750 0.22
1000 7.43
2000 62.72
4000 26.11
10000 1.27
20000 0.03
50000 0.01
Latency Range
(µs) Percentile
1000 0.33
2000 43.16
4000 28.11
10000 21
20000 3.31
50000 2.17
100000 1.35
250000 0.4
500000 0.11
750000 0.04
1000000 0.01
2000000 0.01
Average latency : 1.7ms
2.21 0.22
7.43
62.72
26.12
1.28 0.03 0.01
0
20
40
60
80
Percentile
Latency range in us
Random Read Latency
Histogram
99 percent of
the IOPs is
within 5ms
latency
0.33
43.16
28.11
21
3.312.171.350.40.110.040.010.01
0
10
20
30
40
50
Percentile
Latency Range in us
Random Write Latency
Histogram
99% IOPS : 178367.31 99% IOPS : 227936.61

IFOS Object Performance
 Erasure Coding provides equivalent of 3x
replica storage with only 1.2x storage
 Object performance is on par with Block
performance
 Higher node clusters = higher EC ratio = more
storage savings
• Replication Configuration
• OSD level replication with 2 copies
• Erasure Coding Configuration
• Node level Erasure coding with Couchy-Good 6+2
• Couchy Good is better suited to InfiniFlash vs.
Reed Solomon
0
5
10
15
Repl (2x) -
Read
EC (6+2)
Read
Repl (2x) -
Write
EC (6+2)
Write
GBps
Protection Scheme
4M Objects Throughout - Erausre
Coding vs. Replication

IF550 Reference Configurations
Workload Small Medium Large
Small Block I/O
 2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
 1 OSD Server per 4-8 cards
o Dual E5-2680
o 64GB RAM
 2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
o Dual E5-2687
o 128GB RAM
 2+ IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
o Dual E5-2697+
o 128GB RAM
Throughput
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 16 cards
o Dual E5-2660
o 64GB RAM
 2x IF150
o 128TB to 512 TB
Flash per enclosure
o Dual E5-2680
o 128GB RAM
 2+ IF150
o 128TB to 512 TB
Flash per enclosure
o Dual E5-2680
o 128GB RAM
Mixed
 2x IF150
o 128TB to 512 TB
Flash per enclosure
 1 OSD Server per 8-16
cards
o Dual E5-2680
o 64GB RAM
 2x IF150
o 128TB to 512 TB
Flash per enclosure
cards
o Dual E5-2690+
o 128GB RAM
 2+ IF150
o 128TB to 256 TB
Flash per enclosure
(optional
performance CARD)
cards
o Dual E5-2695+
o 128GB RAM

InfiniFlash TCO Advantage
 Reduce the replica count from 3 to 2
 Less compute, less HW and SW
• TCO analysis based on a US customer’s OPEX & Cost
data for a 5PB deployment
33
$-
$5
$10
$15
$20
$25
$30
InfiniFlash External AFA DAS SSD node DAS 10k HDD
node
Millions
5 Year TCO
CAPEX OPEX
0 20 40
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Data Center
Racks
$- $2,000 $4,000
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Thousands
Total Energy Cost

35
Flash is on par or cheaper than buying HDDs

SanDisk: Potential Future Improvements
 RDMA intra-cluster communication
• Significant reduction in CPU / IOP
 BlueStore
• Significant reduction in write amplification -> even higher write
performance
 Memory allocation
• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
 Erasure Coding for Blocks (native)
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view

Time to fix the write path algorithm
 Review of FileStore
• What’s wrong with FileStore
• XFS + levelDB
• Missing transactional semantics for metadata and data
• Missing virtual-copy and merge semantics
• BTRFS implementation of these isn’t general enough
 Snapshot/rollback overhead too expensive for frequent use
 Transaction semantics aren’t crash proof
 Bad Write amp.
 Bad jitter due to unpredictable file system
 Bad CPU utilization, syncfs is VERY expensive

BlueStore
 One, Two or Three raw block devices
• Data, Metadata/WAL and KV Journaling
• When combined no fixed partitioning is
needed
 Use a single transactional KV store for all
metadata
• Semantics are well matched to
ObjectStore transactions
 Use raw block device for data storage
• Support Flash, PMR and SMR HDD
ObjectStore
BlueStore
KeyValueDB
Data
MetaData
BlueFS
Operation Decoder
Journal
Client/Gateway Operations Peer-to-Peer Cluster
Management
Network Interface

BlueStore vs FileStore
1 800GB P3700 card (4 OSDs per), 64GB ram, 2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 1 x Intel 40GbE link
client fio processes and mon were on the same nodes as the OSDs.

Emerging Storage Solutions (EMS) SanDisk Confidential 41
KV Store Options
 RocksDB is a Facebook extension of levelDB
– Log Structured Merge (LSM) based
– Ideal when metadata is on HDD
– Merge is effectively host-based GC when run on flash
 ZetaScale™ from SanDisk® now open sourced
– B-tree based
– Ideal when metadata is on Flash
– Uses device-based GC for max performance

BlueStore ZetaScale v RocksDB Performance
Test Setup:
1 OSD, 8TB SAS SSD, 10GB ram, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz , fio, 32 thds, 64 iodepth, 6TB dataset, 30
min
0.436
1.005
3.970
0.95
2.83
9.29
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
0/100 70/30 100/0
IOPsinKs
Read/Write Ratio
Random Read/Write 4K IOPs per OSD
BlueStore(RocksDB)
BlueStore(ZetaScale)

The InfiniFlash™ System ...
Power
70% Less
Speed
40x Faster
than SAN
Density
10X Higher
Reliability
20x Better
AFR
Cost
Up to 80%
Lower TCO

44
Thank You! @BigDataFlash
#bigdataflash
©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of
SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Ceph Day KL - Ceph on All-Flash Storage

More Related Content

What's hot

Viewers also liked

Similar to Ceph Day KL - Ceph on All-Flash Storage

Recently uploaded

Ceph Day KL - Ceph on All-Flash Storage

Editor's Notes