1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Zhou Hao
Technical Marketing Engineer
June 6th, 2015
Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
Requirement from Big Data @ PB Scale
 Mixed media container, active-
archiving, backup, locality of data
 Large containers with application
SLAs
 Internet of Things, Sensor
Analytics
 Time-to-Value and Time-to-Insight
 Hadoop
 NoSQL
 Cassandra
 MongoDB
 High read intensive access from
billions of edge devices
 Hi-Def video driving even greater
demand for capacity and
performance
 Surveillance systems, analytics
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
InfiniFlash™ System
• Ultra-dense All-Flash Appliance
- 512TB in 3U
• Scale-out software for massive capacity
- Unified Content: Block, Object
- Flash optimized software with
programmable interfaces (SDK)
• Enterprise-Class storage features
- snapshots, replication, thin
provisioning
• Enhanced Performance for Block and
Object
- 10x Improvement for Block Reads
- 2x Improvement for Object Reads
IF500 with InfiniFlash OS (Ceph)
Ideal for large-scale storage
&
Best in class $/IOPS/TB
InfiniFlash Hardware System
Capacity 512TB* raw
 All-Flash 3U Storage System
 64 x 8TB Flash Cards with Pfail
 8 SAS ports total
Operational Efficiency and Resilient
 Hot Swappable components, Easy
FRU
 Low power 450W(avg), 750W(active)
 MTBF 1.5+ million hours
Scalable Performance**
 780K IOPS
 7GB/s Throughput
 Upgrade to 12GB/s in Q315
* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.
** Based on internal testing of InfiniFlash 100. Test report available.
Innovating Performance @ InfiniFlash OS
 Major Improvements to
Enhance Parallelism
 Backend Optimizations
– XFS and Flash
 Messenger Performance
Enhancements
• Message signing
• Socket Read aheads
• Resolved severe lock contentions
• Reduced ~2 CPU core usage with improved file
path resolution from object ID
• CPU and Lock optimized fast path for reads
• Disabled throttling for Flash
• Index Manager caching and Shared FdCache in
filestore
• Removed single Dispatch queue bottlenecks for
OSD and Client (librados) layers
• Shared thread pool implementation
• Major lock reordering
• Improved lock granularity – Reader / Writer locks
• Granular locks at Object level
• Optimized OpTracking path in OSD eliminating
redundant locks
Open Source with SanDisk Advantage
InfiniFlash OS – Enterprise Level Hardened Ceph
Enterprise Level Hardening
 9,000 hours of
cumulative IO tests
 1,100+ unique test cases
 1,000 hours of cluster
rebalancing tests
 1,000 hours of IO on
iSCSI
Testing at Hyperscale
 Over 100 server node
clusters
 Over 4PB of flash storage
Failure Testing
 2,000 cycle node reboot
 1,000 times node abrupt
power cycle
 1,000 times storage
failure
 1,000 times network
failure
 IO for 250 hours at a
stretch
Enterprise Level Support
 Enterprise class
support and services
from SanDisk
 Risk mitigation through
long term support
and a reliable long
term roadmap
 Continual contribution
back to the community
Test Configuration – Single InfiniFlash System
Performance improves 2x to 12x depending on the Block size
Performance Improvement: Stock Ceph vs IF OS
8K Random Blocks
Top Row: Queue Depth
Bottom Row: % Read IOs
IOPS
Avglatenv(ms)
Avg Latency
0
50000
100000
150000
200000
250000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
(Giant)
IFOS 1.0
0
20
40
60
80
100
120
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
• 2 RBD/Client x Total 4 Clients
• 1 InfiniFlash node with 512TB
IOPS
Top Row: Queue Depth
Bottom Row: % Read IOs
0
20000
40000
60000
80000
100000
120000
140000
160000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
IFOS 1.0
AvgLatency(ms)
0
20
40
60
80
100
120
140
160
180
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
IOPS
Performance Improvement: Stock Ceph vs IF OS
64K Random Blocks
IOPS Avg Latency
• 2 RBD/Client x Total 4 Clients
• 1 InfiniFlash node with 512TB
Top Row: Queue Depth
Bottom Row: % Read IOs
Top Row: Queue Depth
Bottom Row: % Read IOs
Performance Improvement: Stock Ceph vs IF OS
256K Random Blocks
0
5000
10000
15000
20000
25000
30000
35000
40000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
IFOS 1.0
0
50
100
150
200
250
300
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
IOPS
AvgLatency(ms)
IOPS Avg Latency
Top Row: Queue Depth
Bottom Row: % Read IOs
Top Row: Queue Depth
Bottom Row: % Read IOs
• 2 RBD/Client x Total 4 Clients
• 1 InfiniFlash node with 512TB
Test Configuration – 3 InfiniFlash Systems (128TB each)
Performance scales linearly with additional InfiniFlash nodes
Scaling with Performance
8K Random Blocks
0
100000
200000
300000
400000
500000
600000
700000
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
0
50
100
150
200
250
300
350
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
IOPS Avg Latency
• 2 RBD/Client x 5 Clients
• 3 InfiniFlash nodes with 128TB each
Top Row: Queue Depth
Bottom Row: % Read IOs
Top Row: Queue Depth
Bottom Row: % Read IOs
IOPS
AvgLatency(ms)
Scaling with Performance
64K Random Blocks
0
50000
100000
150000
200000
250000 1
4
16
64
256
2
8
32
128
1
4
16
64
256
2
8
32
128
1
4
16
64
256
0 25 50 75 100
0
100
200
300
400
500
600
700
800
900
1000
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
IOPS Avg Latency
• 2 RBD/Client x 5 Clients
• 3 InfiniFlash nodes with 128TB each
Top Row: Queue Depth
Bottom Row: % Read IOs
Top Row: Queue Depth
Bottom Row: % Read IOs
IOPS
AvgLatency(ms)
Scaling with Performance
256K Random Blocks
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1
4
16
64
256
2
8
32
128
1
4
16
64
256
2
8
32
128
1
4
16
64
256
0 25 50 75 100
0
500
1000
1500
2000
2500
3000
3500
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
IOPS Avg Latency
• 2 RBD/Client x 5 Clients
• 3 InfiniFlash nodes with 128TB each
Top Row: Queue Depth
Bottom Row: % Read IOs
Top Row: Queue Depth
Bottom Row: % Read IOs
IOPS
AvgLatency(ms)
Flexible Ceph Topology with InfiniFlash
SAS
HSEB A HSEB B
OSDs
….
HSEB A HSEB B HSEB A HSEB B
…. LUN LUN
Client Application
…LUN LUN
Client Application
…LUN LUN
Client Application
…
RBDs / RGW
SCSI Targets
ReadIOO
Write IO
RBDs / RGW
SCSI Targets
RBDs / RGW
SCSI Targets
OSDs OSDs OSDs OSDs OSDs
ReadIOO
ReadIOO
 Disaggregated Architecture
 Optimized for Performance
 Higher Utilization
 Reduced CostsStorage Farm
Compute Farm
Flash + HDD with Data Tier-ing
Flash Performance with TCO of HDD
 InfiniFlash OS performs automatic data
placement and data movement between tiers
based transparent to Applications
 User defined Policies for data placement on
tiers
 Can be used with Erasure coding to further
reduce the TCO
Benefits
 Flash based performance with HDD like TCO
 Lower performance requirements on HDD tier
enables use of denser and cheaper SMR drives
 Denser and lower power compared to HDD only
solution
 InfiniFlash for High Activity data and SMR drives
for Low activity data
 60+ HDD per Server
Compute Farm
Flash Primary + HDD Replicas
Flash Performance with TCO of HDD
Primary replica on
InfiniFlash
HDD based data node
for 2nd local replica
HDD based data node
for 3rd DR replica
 Higher Affinity of the Primary Replica ensures much
of the compute is on InfiniFlash Data
 2nd and 3rd replicas on HDDs are primarily for data
protection
 High throughput of InfiniFlash provides data
protection, movement for all replicas without
impacting application IO
 Eliminates cascade data propagation requirement
for HDD replicas
 Flash-based accelerated Object performance for
Replica 1 allows for denser and cheaper SMR HDDs
for Replica 2 and 3
Compute Farm
TCO Example - Object Storage
Scale-out Flash Benefits at the TCO of HDD
$-
$1,000
$2,000
$3,000
$4,000
$5,000
$6,000
$7,000
$8,000
Traditional
ObjStore on
HDD
InfiniFlash
ObjectStore -3
Full Replicas
on Flash
InfiniFlash
with
ErasureCoding
- All Flash
InfiniFlash -
Flash Primary
& HDD copies
x10000
3Y TCO comparison for 96PB object storage
3 Year Opex
TCA
0
20
40
60
80
100
Total Racks
• Weekly failure rate for a 100PB deployment
15-35 HDD vs. 1 InfiniFlash Card
• HDD cannot handle simultaneous egress/ingress
• HDD long rebuild times, multiple failures and
rebalancing of data impact in service disruption
• Flash provides guaranteed & consistent SLA
• Flash capacity utilization >> HDD due to
reliability & ops
• Flash low power consumption
450W(avg), 750W(active)
Note that operational/maintenance cost and performance benefits are not accounted for in these models!!!
InfiniFlash™ System
The First All-Flash Storage System Built for High Performance Ceph
21
© 2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP
LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).
http://bigdataflash.sandisk.com/infiniflash
Steven.Xi@SanDisk.com Sales
Tonny.Ai@SanDisk.com Sales Engineering
Hao.Zhou@SanDisk.com Technical Marketing
Venkat.Kolli@SanDisk.com Production Management

Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers

  • 1.
    1 Ceph on All-FlashStorage – Breaking Performance Barriers Zhou Hao Technical Marketing Engineer June 6th, 2015
  • 2.
    Forward-Looking Statements During ourmeeting today we will make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, industry trends, future products, product performance and product capabilities. This presentation also contains forward-looking statements attributed to third parties, which reflect their projections as of the date of issuance. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.
  • 3.
    Requirement from BigData @ PB Scale  Mixed media container, active- archiving, backup, locality of data  Large containers with application SLAs  Internet of Things, Sensor Analytics  Time-to-Value and Time-to-Insight  Hadoop  NoSQL  Cassandra  MongoDB  High read intensive access from billions of edge devices  Hi-Def video driving even greater demand for capacity and performance  Surveillance systems, analytics CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
  • 4.
    InfiniFlash™ System • Ultra-denseAll-Flash Appliance - 512TB in 3U • Scale-out software for massive capacity - Unified Content: Block, Object - Flash optimized software with programmable interfaces (SDK) • Enterprise-Class storage features - snapshots, replication, thin provisioning • Enhanced Performance for Block and Object - 10x Improvement for Block Reads - 2x Improvement for Object Reads IF500 with InfiniFlash OS (Ceph) Ideal for large-scale storage & Best in class $/IOPS/TB
  • 5.
    InfiniFlash Hardware System Capacity512TB* raw  All-Flash 3U Storage System  64 x 8TB Flash Cards with Pfail  8 SAS ports total Operational Efficiency and Resilient  Hot Swappable components, Easy FRU  Low power 450W(avg), 750W(active)  MTBF 1.5+ million hours Scalable Performance**  780K IOPS  7GB/s Throughput  Upgrade to 12GB/s in Q315 * 1TB = 1,000,000,000,000 bytes. Actual user capacity less. ** Based on internal testing of InfiniFlash 100. Test report available.
  • 6.
    Innovating Performance @InfiniFlash OS  Major Improvements to Enhance Parallelism  Backend Optimizations – XFS and Flash  Messenger Performance Enhancements • Message signing • Socket Read aheads • Resolved severe lock contentions • Reduced ~2 CPU core usage with improved file path resolution from object ID • CPU and Lock optimized fast path for reads • Disabled throttling for Flash • Index Manager caching and Shared FdCache in filestore • Removed single Dispatch queue bottlenecks for OSD and Client (librados) layers • Shared thread pool implementation • Major lock reordering • Improved lock granularity – Reader / Writer locks • Granular locks at Object level • Optimized OpTracking path in OSD eliminating redundant locks
  • 7.
    Open Source withSanDisk Advantage InfiniFlash OS – Enterprise Level Hardened Ceph Enterprise Level Hardening  9,000 hours of cumulative IO tests  1,100+ unique test cases  1,000 hours of cluster rebalancing tests  1,000 hours of IO on iSCSI Testing at Hyperscale  Over 100 server node clusters  Over 4PB of flash storage Failure Testing  2,000 cycle node reboot  1,000 times node abrupt power cycle  1,000 times storage failure  1,000 times network failure  IO for 250 hours at a stretch Enterprise Level Support  Enterprise class support and services from SanDisk  Risk mitigation through long term support and a reliable long term roadmap  Continual contribution back to the community
  • 8.
    Test Configuration –Single InfiniFlash System Performance improves 2x to 12x depending on the Block size
  • 9.
    Performance Improvement: StockCeph vs IF OS 8K Random Blocks Top Row: Queue Depth Bottom Row: % Read IOs IOPS Avglatenv(ms) Avg Latency 0 50000 100000 150000 200000 250000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Stock Ceph (Giant) IFOS 1.0 0 20 40 60 80 100 120 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 • 2 RBD/Client x Total 4 Clients • 1 InfiniFlash node with 512TB IOPS Top Row: Queue Depth Bottom Row: % Read IOs
  • 10.
    0 20000 40000 60000 80000 100000 120000 140000 160000 1 4 161 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Stock Ceph IFOS 1.0 AvgLatency(ms) 0 20 40 60 80 100 120 140 160 180 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 IOPS Performance Improvement: Stock Ceph vs IF OS 64K Random Blocks IOPS Avg Latency • 2 RBD/Client x Total 4 Clients • 1 InfiniFlash node with 512TB Top Row: Queue Depth Bottom Row: % Read IOs Top Row: Queue Depth Bottom Row: % Read IOs
  • 11.
    Performance Improvement: StockCeph vs IF OS 256K Random Blocks 0 5000 10000 15000 20000 25000 30000 35000 40000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Stock Ceph IFOS 1.0 0 50 100 150 200 250 300 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 IOPS AvgLatency(ms) IOPS Avg Latency Top Row: Queue Depth Bottom Row: % Read IOs Top Row: Queue Depth Bottom Row: % Read IOs • 2 RBD/Client x Total 4 Clients • 1 InfiniFlash node with 512TB
  • 12.
    Test Configuration –3 InfiniFlash Systems (128TB each) Performance scales linearly with additional InfiniFlash nodes
  • 13.
    Scaling with Performance 8KRandom Blocks 0 100000 200000 300000 400000 500000 600000 700000 1 8 64 1 8 64 1 8 64 1 8 64 1 8 64 0 25 50 75 100 0 50 100 150 200 250 300 350 1 8 64 1 8 64 1 8 64 1 8 64 1 8 64 0 25 50 75 100 IOPS Avg Latency • 2 RBD/Client x 5 Clients • 3 InfiniFlash nodes with 128TB each Top Row: Queue Depth Bottom Row: % Read IOs Top Row: Queue Depth Bottom Row: % Read IOs IOPS AvgLatency(ms)
  • 14.
    Scaling with Performance 64KRandom Blocks 0 50000 100000 150000 200000 250000 1 4 16 64 256 2 8 32 128 1 4 16 64 256 2 8 32 128 1 4 16 64 256 0 25 50 75 100 0 100 200 300 400 500 600 700 800 900 1000 1 8 64 1 8 64 1 8 64 1 8 64 1 8 64 0 25 50 75 100 IOPS Avg Latency • 2 RBD/Client x 5 Clients • 3 InfiniFlash nodes with 128TB each Top Row: Queue Depth Bottom Row: % Read IOs Top Row: Queue Depth Bottom Row: % Read IOs IOPS AvgLatency(ms)
  • 15.
    Scaling with Performance 256KRandom Blocks 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 1 4 16 64 256 2 8 32 128 1 4 16 64 256 2 8 32 128 1 4 16 64 256 0 25 50 75 100 0 500 1000 1500 2000 2500 3000 3500 1 8 64 1 8 64 1 8 64 1 8 64 1 8 64 0 25 50 75 100 IOPS Avg Latency • 2 RBD/Client x 5 Clients • 3 InfiniFlash nodes with 128TB each Top Row: Queue Depth Bottom Row: % Read IOs Top Row: Queue Depth Bottom Row: % Read IOs IOPS AvgLatency(ms)
  • 16.
    Flexible Ceph Topologywith InfiniFlash SAS HSEB A HSEB B OSDs …. HSEB A HSEB B HSEB A HSEB B …. LUN LUN Client Application …LUN LUN Client Application …LUN LUN Client Application … RBDs / RGW SCSI Targets ReadIOO Write IO RBDs / RGW SCSI Targets RBDs / RGW SCSI Targets OSDs OSDs OSDs OSDs OSDs ReadIOO ReadIOO  Disaggregated Architecture  Optimized for Performance  Higher Utilization  Reduced CostsStorage Farm Compute Farm
  • 17.
    Flash + HDDwith Data Tier-ing Flash Performance with TCO of HDD  InfiniFlash OS performs automatic data placement and data movement between tiers based transparent to Applications  User defined Policies for data placement on tiers  Can be used with Erasure coding to further reduce the TCO Benefits  Flash based performance with HDD like TCO  Lower performance requirements on HDD tier enables use of denser and cheaper SMR drives  Denser and lower power compared to HDD only solution  InfiniFlash for High Activity data and SMR drives for Low activity data  60+ HDD per Server Compute Farm
  • 18.
    Flash Primary +HDD Replicas Flash Performance with TCO of HDD Primary replica on InfiniFlash HDD based data node for 2nd local replica HDD based data node for 3rd DR replica  Higher Affinity of the Primary Replica ensures much of the compute is on InfiniFlash Data  2nd and 3rd replicas on HDDs are primarily for data protection  High throughput of InfiniFlash provides data protection, movement for all replicas without impacting application IO  Eliminates cascade data propagation requirement for HDD replicas  Flash-based accelerated Object performance for Replica 1 allows for denser and cheaper SMR HDDs for Replica 2 and 3 Compute Farm
  • 19.
    TCO Example -Object Storage Scale-out Flash Benefits at the TCO of HDD $- $1,000 $2,000 $3,000 $4,000 $5,000 $6,000 $7,000 $8,000 Traditional ObjStore on HDD InfiniFlash ObjectStore -3 Full Replicas on Flash InfiniFlash with ErasureCoding - All Flash InfiniFlash - Flash Primary & HDD copies x10000 3Y TCO comparison for 96PB object storage 3 Year Opex TCA 0 20 40 60 80 100 Total Racks • Weekly failure rate for a 100PB deployment 15-35 HDD vs. 1 InfiniFlash Card • HDD cannot handle simultaneous egress/ingress • HDD long rebuild times, multiple failures and rebalancing of data impact in service disruption • Flash provides guaranteed & consistent SLA • Flash capacity utilization >> HDD due to reliability & ops • Flash low power consumption 450W(avg), 750W(active) Note that operational/maintenance cost and performance benefits are not accounted for in these models!!!
  • 20.
    InfiniFlash™ System The FirstAll-Flash Storage System Built for High Performance Ceph
  • 21.
    21 © 2015 SanDiskCorporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s). http://bigdataflash.sandisk.com/infiniflash Steven.Xi@SanDisk.com Sales Tonny.Ai@SanDisk.com Sales Engineering Hao.Zhou@SanDisk.com Technical Marketing Venkat.Kolli@SanDisk.com Production Management