Audience: Beginner to Intermediate
About: Overall increases in CPU and DRAM processing power are falling behind the massive acceleration in available storage and network bandwidth. Storage management services are emerging as a serious bottleneck. What does this imply for the datacenter of the future? How will it affect the physical network and storage topologies? And how will storage software need to change to meet these new realities?
Speaker Bio: Allen joined SanDisk in 2013 as an Engineering Fellow, he is responsible for directing software development for SanDisk’s system level products. He has previously served as Chief Architect at Weitek Corp. and Citrix, and founded several companies including AMKAR Consulting, Orbital Data Corporation, and Cirtas Systems. Allen has a Bachelor of Science in Electrical Engineering from Rice University.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk
1. May 5, 2016 1
Allen Samuels
The Consequences of Infinite Storage Bandwidth
Engineering Fellow, Systems and Software Solutions
May 5, 2016
2. May 5, 2016 2
Disclaimer
During the presentation today, we may make forward-looking statements.
Any statement that refers to expectations, projections, or other characterizations of future events or circumstances is a forward-
looking statement, including those relating to industry predictions and trends, future products and their projected availability,
and evolution of product capacities. Actual results may differ materially from those expressed in these forward-looking
statements due to a number of risks and uncertainties, including among others: industry predictions may not occur as expected,
products may not become available as expected, and products may not evolve as excepted; and the factors detailed under the
caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including, but not limited to, our
annual report on Form 10-K for the year ended January 3, 2016. This presentation contains information from third parties,
which reflect their projections as of the date of issuance. We undertake no obligation to update these forward-looking
statements, which speak only as of the date hereof or the date of issuance by a third party.
3. May 5, 2016 3
What do I Mean By Infinite Bandwidth ?
4. May 5, 2016 4
Log scale
• Use DRAM Bandwidth as a
proxy for CPU throughput
• Reasonable approximation
for DMA heavy, and/or
poor cache hit
performance workloads
(e.g. Storage)
Bigdifference
inslope!
Data is for informational purposes only and may contain errors
Network, Storage and DRAM Trends
5. May 5, 2016 5
Linear scale
InfiniteStorageBandwidth
• Same data as last slide,
but for the Log-
impaired
• Storage Bandwidth is
not literally infinite
• But the ratio of
Network and Storage
to CPU throughput is
widening very quickly
Data is for informational purposes only and may contain errors
Network, Storage and DRAM Trends
6. May 5, 2016 6
0
50
100
150
200
250
1990 1995 2000 2005 2010 2015 2020 2025
Year
SSDs / CPU Socket
Data is for informational purposes only and may contain errors
7. May 5, 2016 7
0
5
10
15
20
25
30
35
40
45
50
1995 2000 2005 2010 2015 2020 2025
Year
SSDs / CPU Socket @ 20% Max BW
Data is for informational purposes only and may contain errors
8. May 5, 2016 8
What happens as we get closer to the limit?
9. May 5, 2016 9
New Denser Server Form Factors
– Blades
– Sleds
Good short term solutions
Let’s Get Small!
10. May 5, 2016 10
Storage Cost = Media + Access + Management
Shared nothing architecture conflates access and management
Storage costs will become dominated by Management cost
Storage costs become CPU/DRAM costs
Effects Of The CPU/DRAM Bottleneck
11. May 5, 2016 11
Move management to upper layers where CPU can be right-sized by client
What kind of media access do I want?
– Simple enough functionality to be done directly in drive hardware – NO CPU
– Allow direct access throughout the compute cluster over a network
– Just enough machinery to enable coarse-grained sharing
Embracing The CPU/DRAM Bottleneck
In short, you really want a SAN !
– Or more technically, Fabric Connected Storage
12. May 5, 2016 12
Not Your Father’s SAN
Three problems with current SAN
– Fibre channel transport
– SCSI access protocol
– Drive oriented storage allocation
All of these want to be updated
– Fibre channel is brittle and costly
– SCSI initiators have long code paths catering to seldom used configurations
– Robust sub-drive storage allocation
13. May 5, 2016 13
SAN 2.0
NVMe over Fabrics
1.0 Spec is out for review, hopefully done in May
Simple enough for direct hardware execution of data path ops
Minimal initiator code path lengths improve performance
Namespaces allow sub-drive allocations
Not mature enough for enterprise deployment – yet
14. May 5, 2016 14
SAN 2.0
What storage network?
– Current candidates are FC, Infiniband and Ethernet
Ethernet has best economics – if you can make it work
RoCE is easy on the edge, but hard on the interior
– Only controlled environments have shown multi-switch scalability
– General scalability in a multi-vendor environment likely to be difficult
– Wonderful for intra-rack storage networking
iWarp is hard on the edge, but easy on the interior
– Scarcity of implementations inhibits deployment
Storage over IP will see limited cross rack deployment until this is resolved
15. May 5, 2016 15
Implementations using OTS stuff are in progress
Server side implementations look pretty conventional too
4-5 MIOPS have been shown
Seems like 10 MIOPS isn’t unreasonable to expect
First Generation Of SAN 2.0
NIC
CPU DRAM
SSD
PCIe
16. May 5, 2016 16
Soon, NICs will forward NVMe operations to local PCIe devices
CPU removed from the software part of the data path
CPU is still needed for the hardware part of the data path
IOPS improve, BW is unchanged
Significant CPU freed for application processing
Getting closer to the wall!
Second Generation SAN 2.0
17. May 5, 2016 17
New generation of combined SSD controller and NIC
– Rethink of interfaces eliminates DRAM buffering
Network goes right into the drive
No CPU to be found
Works well with rack scale architecture
Third Generation SAN 2.0, Imagined
18. May 5, 2016 18
Disaggregated / Rack Scale Architecture
– Fabric connected
– Independently scale compute, networking and storage
Let’s Get Really Small
19. May 5, 2016 19
Call To Action
Fabric-connected storage isn’t well managed by existing FOSS
Lots of upper layer management software is available
– OpenStack, Ceph, Gluster, Cassandra, MongoDB, SheepDog, etc.
Lower layer cluster management still primitive
20. May 5, 2016 20
What’s It All Mean?
New form factors are in everybody's future
The coming avalanche of storage bandwidth wants to be free
– Not imprisoned by a CPU
Rack Scale Architecture allows new Storage/Compute configs
Storage will be increasingly “Software Defined” as the HW evolves
22. May 5, 2016 22
Old Model
Monolithic, large upfront
investments, and fork-lift upgrades
Proprietary storage OS
Costly: $$$$$
New SD-AFS Model
Disaggregate storage, compute, and software for
better scaling and costs
Best-in-class solution components
Open source software - no vendor lock-in
Cost-efficient: $
Software-defined All-Flash Storage
The disaggregated model for scale
23. May 5, 2016 23
Scalable Raw Performance
2M IOPS, Latency 1-3ms
12-15 GB/s Throughput
8TB Flash-Card Innovations
• Enterprise Grade Power-Fail Safe
• Alerts & monitoring
• Latching integrated & monitored
• Directly samples air temp
• Form-factor enables lowest cost SSD
InfiniFlash™ Storage Platform
Capacity 512TB – raw all Flash!
All Flash 3U JBOD of Flash (JBOF)
Up to 64 x 8TB SAS Drive Cards
4TB cards also available soon
Operational Efficiency & Resilient
Hot Swappable Architecture, Easy FRU
Low power – typical workload 400-500W
150W(idle) - 750W(max)
MTBF 1.5+ million hours
Hot Swappable !
Fans, SAS Expander Boards,
Power Suppliers, Flash cards
Host Connectivity
Connect up to 8 servers
through 8 SAS ports
Multi-path enabled
Flash Drive Card
EMS Product Management SanDisk Confidential
24. May 5, 2016 24
InfiniFlash IF500 All-Flash Storage System
Block and Object Storage Powered by Ceph
Ultra-dense High Capacity Flash storage
– 512TB in 3U, Scale-out software for PB scale capacity
Highly scalable performance
– Industry leading IOPS/TB
Cinder, Glance and Swift storage
– Add/remove server & capacity on-demand
Enterprise-Class storage features
– Automatic rebalancing
– Hot Software upgrade
– Snapshots, replication, thin provisioning
– Fully hot swappable, redundant
Ceph Optimized for SanDisk flash
– Tuned & Hardened for InfiniFlash
25. May 5, 2016 25
InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for
Hardware
• Ceph modifications for Flash
• Both Ceph, Host OS tuned for
InfiniFlash
• SW defects that impacts Flash
identified & mitigated
Hardware Configured
for Software
• Right balance of CPU, RAM,
Storage
• Rack level designs for optimal
performance & cost
Software designed for all
systems does not work well with
any system
Ceph has over 50 tuning
parameters that results in 5x – 6x
performance improvement
Fixed CPU, RAM hyperconverged
nodes does not work well for all
workloads
26. May 5, 2016 26
InfiniFlash for OpenStack with Dis-Aggregation
Compute & Storage Disaggregation enables
Optimal Resource utilization
Allows for more CPU usage required for OSDs with
small Block workloads
Allows for higher bandwidth provisioning as required
for large Object workload
Independent Scaling of Compute and
Storage
Higher Storage capacity needs doesn't’t force you to
add more compute and vice-versa
Leads to optimal ROI for PB scale
OpenStack deploymentsHSEB A HSEB B
OSDs
SAS
….
HSEB A HSEB B HSEB A HSEB B
….
ComputeFarm
LUN LUN
iSCSI Storage
…Obj Obj
Swift ObjectStore
…LUN LUN
Nova with Cinder
& Glance
…
LibRBD
QEMU/KVM
RGW
WebServer
KRBD
iSCSI Target
OSDs OSDs OSDs OSDs OSDs
StorageFarm
Confidential – EMS Product Management
27. May 5, 2016 27
IF500 - Enhancing Ceph for Enterprise Consumption
IF500 provides usability and performance utilities without sacrificing Open Source principles
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality
• All Ceph Performance improvements developed by SanDisk are contributed back to community
27
SanDisk
Distribution or
Community
Distribution
Out-of-the Box
configurations tuned for
performance with Flash
Sizing & planning tool
InfiniFlash drive
management integrated
into Ceph management
(Coming Soon)
Ceph installer that is specifically built for InfiniFlash
High performance iSCSI storage
Better diagnostics with log collection tool
Enterprise hardened SW + HW QA
28. May 5, 2016 28
InfiniFlash Performance Advantage
900K Random Read Performance with 384TB of storage
Flash Performance unleashed
• Out-of-the Box configurations tuned for
performance with Flash
• Read & Write data-path changes for Flash
• x3-12 block performance improvement –
depending on workload
• Almost linear performance scale with
addition of InfiniFlash nodes
• Write performance WIP with NV-RAM
Journals• Measured with 3 InfiniFlash nodes with 128TB each
• Avg Latency with 4K Block is ~2ms, with 99.9 percentile
latency is under 10ms
• For Lower block size, performance is CPU bound at Storage
Node.
• Maximum Bandwidth of 12.2GB/s measured towards 64KB
blocks
S
28
29. May 5, 2016 29
InfiniFlash Ceph Performance Advantage
Single InfiniFlash unit Performance
– 1 x 512TB InfiniFlash unit connected with 8 nodes
– 4K RR IOPS: ~1 million IOPs - 85% of bare metal perf.
• Corresponding Bare metal IF100 IOPS is 1.1 million
– All 8 hosts CPU saturated for 4K Random read.
• More performance potential with higher CPU cycles
– With 64k IO size we are able to utilize full IF150
bandwidth of over 12GB/s.
– Librbd and Krbd performance are comparable.
– Write Performance is on 3x copy configuration. The
more common 2x copy will result in 33% improvement.
Random Write
IO Profile LIBRBD IOPs
4k Random Write 54k
64k Random Write 34k
256k Random Write 11.3k
1,123,175
349,247
87,369
0
5
10
15
20
25
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
4k 64k 256k
BW(GBps)
IOPS
Random Read Block Performance
LIBRBD IOPs Bandwidth (GBps)
30. May 5, 2016 30
InfiniFlash Ceph Performance Advantage
Linear Scaling with 2 InfiniFlash units
– 2 x 512TB InfiniFlash unit connected with 16 nodes
– 1.8M 4K IOPS – 80% of the bare metal performance
– Performance is Scaling almost linearly - Almost doubled the
performance of single IF150 with ceph
– Write perf is 2 X with 16 node cluster compared with 8 node
cluster.
Random Read
Random Write
IO Profile LIBRBD IOPs
4k RR 1800k
64k RR 225k
256k RR 53k
IO Profile LIBRBD BW(MB/s)
4k RR 7194
64k RR 14412
256k RR 13366
31. May 5, 2016 31
InfiniFlash OS – Hardened Enterprise Class Ceph
Hardened and tested for Hyperscale
deployments and workloads
Platform focused testing enables us to deliver a
complete and hardened storage solution
Single Vendor support for both Hardware &
Software
Enterprise Level
Hardening
Testing at
Scale
Failure
Testing
9,000 hours
of cumulative
IO tests
1,100+
unique test
cases
1,000 hours
of Cluster
Rebalancing
tests
1,000 hours
of IO on iSCSI
Over 100
server node
clusters
Over 4PB of
Flash Storage
2,000 Cycle
Node Reboot
1,000 times
Node Abrupt
Power Cycle
1,000 times
Storage Failure
1,000 times
Network
Failure
IO for 250
hours at a
stretch
32. May 5, 2016 32
IF500 Reference Configurations
Model Entry Mid High
InfiniFlash 128TB 256TB 512TB
Servers1 2 x Dell R 630-2U 4 x Dell R 630-2U 4 x Dell R 630-2U2
Processor per server Dual socket Intel Xeon E5-2690 v3 Dual socket Intel Xeon E5-2690 v3 * Dual socket Intel Xeon E5-2690 v3
Memory per server 128GB RAM 128GB RAM 128GB RAM
HBA per server (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps
Network per server
(1) Mellanox ConnectX-3 dual ports
40GbE
(1) Mellanox ConnectX-3 dual ports
40GbE
(1) Mellanox ConnectX-3 dual ports
40GbE
Boot Drive per server (2) SATA 120GB SSD (2) SATA 120GB SSD (2) SATA 120GB SSD
1 - For larger block workload or less CPU intensive workload, OSD node could use single socket server.
Dell Servers can be substituted with other vendor servers that match the specs.
2 - For Small Block workloads, 8 servers are recommended
33. May 5, 2016 33
InfiniFlash TCO Advantage
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
Tradtional ObjStore on
HDD
IF500 ObjStore w/ 3
Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary
& HDD Copies
3 year TCO comparison *
3 year Opex
TCA
0
20
40
60
80
100
Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full
Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDD
Copies
Total Rack
Reduce the replica count with higher
reliability of flash
- 2 copies on InfiniFlash vs. 3 copies on
HDD
InfiniFlash disaggregated architecture
reduces compute usage, thereby
reducing HW & SW costs
- Flash allows the use of erasure coded
storage pool without performance
limitations
- Protection equivalent of 2x storage with
only 1.2x storage
Power, real estate, maintenance cost
savings over 5 year TCO
* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment
33